Natural Language Processing : Language Model & Recurrent Neural Network (RNN)

7 min readFeb 8, 2021

Language model is an important subcomponent in many of Natural Language Processing (NLP) tasks covering areas such as handwriting recognition, speech recognition, spelling correction and machine translation. One familiar example is “auto complete” function which we use everyday while sending messages. In this article, we are going to introduce about Statistical Language Model — n-gram, Neural Language Model using vanilla RNN.

Language Model

Language model is a system that assign the probability to a piece of text. Given some text x(1), … , x(T), the probability of the text is

The above equation is useful for translation systems to determine whether a word sequence/sentence is the correct translation for the input. For each sentence translation, various alternative word sequences are generated and the scores are given by running all word sequence candidates through a probability function. The model will choose the sequence with highest score as the output of the translation.

Language model is self-supervised learning where some parts of the samples are used as labels.

So, how do we calculate what next words should be given a sequence of words?

Statistical Language Model

Basic idea of statistical language model is to generate the probability tables of next words given a sequence of words as shown in Figure1. There will be multiple entries (rows) and it is very hard to control the size of table in advance. That is when n-gram language model was introduced.

N-gram

In n-gram, we make a simplifying assumption that next word being predicted depends only on its preceding n words. The purpose of n-gram is to collect statistics about how frequent different n-grams are, and use these to predict next word.

We estimate what is the probability of x given the n previous words using equation:

By using conditional and marginal probability, this can be described as the ratio of the probability of combining next words with previous n-words with respect to the probability of previous n-words.

Given we have this sentence : “A language model is a function that puts a probability measure over strings drawn from some vocabulary” and we want to calculate the probability of vocabulary using 4-gram model. Only “strings drawn from some” will be used to calculate the probability in this case.

Problems with n-gram

There are two problems with n-gram language model: sparsity and storage problem.

Sparsity Problem

The numerator of n-gram probability equation, the count of next word given first n-words -“strings drawn from some w”, can result in zero if it was never occured in the corpus. This is due to — the corpus will not cover all the possible combinations and some of the combinations do not make sense (for example: “strings drawn from some planet”). We will have a lot of entries with zero probability and we will have sparsity problem.

Partial Solution : Smoothing is used to solve the problem by adding small 𝛿 to the count of every w belongs to vocabulary V.

Similarly, the denominator part of n-gram probability equation, could have the similar problem of having zero if the occurrence is not observed in the corpus. This is even worse as this would prohibit us from running model by cause of run time error.

Partial Solution: Backoff method is used where if the number of occurrence of “strings drawn from some” is not found, “drawn from some” is used instead.

Storage Problem

We need a vast storage to store count for all the n-grams appeared in the corpus in order to calculate the probability.

Both sparsity and storage problem become worsen with increasing n.

On the other hand, while using language model to generate text, smaller n-gram could not provide smooth context. We need to increase n to make our language model work well. This draws us into incoherent situation.

To tackle the curse of dimensionality, neural language model was introduced.

Fixed-window Neural Language Model

Neural Language model learns a distributed representation of words, along with the probability function for word sequences expressed in terms of these representations. General steps for neural language model is as shown in Figure 2.

Firstly, input words are fed into table lookup to get the corresponding word embeddings , which are then put into another layer to concatenate them. The concatenated embeddings are then fed into neural network with hidden layer:

Hidden layer provides the vector U having vocabulary size of V. Lastly, softmax function is applied over the vocabulary to get the output distribution.

With fixed-window Neural Language Model, we will not have sparsity problem as well as the model does not need to store all the observed n-grams. However, we still have some problems such as fixed-window being too small, increasing window sizes enlarges vector W in hidden layer, and no symmetry in how the inputs are processed.

We need a neural architecture that can process any length input and this is when Recurrent Neural Network (RNN) comes into play.

Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNN) are capable of conditioning the model on all previous words in the corpus unlike conventional models where a finite window of pervious words is considered for the language model.

The core idea of RNN is applying the same weight Wh repeatedly at each time step, thereby the number of parameters the model has to learn is less and is independent of the length of the input sequence, as a result overcoming the curse of dimensionality.

Figure 3 illustrates the basic RNN Language Model. Each RNN box represents a hidden layer at time-step,t. Each holds a number of neurons, each of which performs a linear matrix operation on its inputs followed by a non-linear operation (e.g. tanh, ReLu) as per below:

At each time-step, there are two inputs to the hidden layer: the output of the previous layer h(t−1), and the input word embedding at that timestep x(t) . The input x(t) is multiplied by its weight matrix W(e) and the output of the previous hidden layer is multiplied by a weight matrix W(h) to produce output features h(t). This captures the relationship to compute the hidden layer output features at each t.

The output features h(t) are then multiplied with a weight matrix U and run through a softmax function over the vocabulary to obtain a prediction output yˆ of the next word.

Training a RNN Language Model

Having a large corpus of text, feed the sequence of words into RNN Language Model and predict output distribution yˆ(t) of every word.

Cross-entropy between predicted probability distribution yˆ(t) and the true next word y(t) is used as loss function and the formula for each step is below:

To get overall loss for entire training set, we can just average them out as per below:

To update the gradients in order to minimize the loss function, backpropagation through time algorithm is used, where the gradient with respect to a repeated weight is the sum of the gradient each time it appears. Stochastic Gradient Descent (SGD) is used to calculate gradients for small chunk of data to reduce the computation expense.

So, for training, steps involve computing the loss for a sentence, computing gradients using backpropagation and update them and repeat.

Evaluation of Language Models

The standard evaluation metric for language model is perplexity. It is a measure of confusion where lower values imply more confidence in predicting the next word in the sequence.

Advantages & disadvantages of RNN

Vanilla RNN language model has helped us solve the problems encountered by fixed-window Neural Language Models. Some of the advantages include 1)the model can process any length input 2) the model size does not increase with longer inputs 3)same weights are applied on each step, thus symmetry in how inputs are processed.

Nevertheless, recurrent computation is slow and it is difficult to access information from many steps back using Vanilla RNN.

In Conclusion, Language Modelling is a benchmark task to measure our progress on understanding language. It is subcomponent of many of NLP cases such as predictive text generation, handwriting recognition, summarization, and language translations.

If you are new to Natural Language Processing, you can refer to my previous posts on Word Vectors and Neural Network to understand underlying knowledge about language model.

Reference & Further Reading

Stanford CS 224N | Natural Language Processing with Deep Learning

Natural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share…

web.stanford.edu