Natural Language Processing — Myanmar Study Group : Word Vectors & Evaluations

6 min readDec 12, 2020

My friends and I are organising Natural Language Processing (NLP) Myanmar Reading Group to discuss about latest research papers, NLP models as well as benchmarking processes. We adopt the CS224N course from Stanford University for lectures and materials. The study group is in interactive seminar-style, where one random participant presents and discusses about an assigned topic every week. We believed this approach drive the participants to have proactive learning approach rather than passively listening to the presenter.

This article is a summary of what we had gone through during our week1 and week2 session on topics Word Vectors & Evaluation methods. Only the ideas on evolution of various methods to generate word vectors are discussed and articles on detailed mathematics can be found in reference section.

Word embedding

The way computer “see” data is different from how humans do. We can visualise in our mind when someone said “I saw a cat” but the machines cannot. They need mathematical representations of words in the vector form, which are also known as word embeddings.

Representing words as discrete symbols

In traditional NLP, we regard words as discrete symbols and represented them by one-hot vectors: for the i-th word in the vocabulary, the vector has 1 on the i-th dimension and 0 on the rest.

In web searches, if the user searches for “Seattle Hotel”, we would like to have the matched result to include “Seattle Motel”. However, one-hot vectors know nothing about the word that they represent and are incapable of returning above result. There is no natural notion of similarity for one-hot vectors and for instance, they “think” that motel is as close to hotel as it is to burger.

Thus, in order for machine to capture the meaning of words, distributional semantics was introduced.

Representing words by their context

For humans, once we saw how the unknown word used in different contexts, we were able to extract its meaning. We achieve it by searching for other words that can be used in the same contexts, and made a conclusion that unknown word has similar meaning to other known words by looking at surrounding context. Basically, words that frequently appear in the similar context have similar meaning.

“You shall know a word by the company it keeps” — J.R.Firth

We could make machine learn “to capture meaning” by putting information about word contexts into word representation.

There are two methods to create the word vectors:

Count-based method
Prediction-based method

1. Count-based method

In count-based methods, we put the information about contexts into word vectors manually based on global corpus statistics.

Two general steps to create word vectors using count-based method

1.Construct the word-context matrix

Below are some of the approaches to create co-occurrence matrix

a. Window based co-occurrence matrix

b.Latent Semantics Analysis

However, just constructing word vector using co-occurrence count method will lead to problems such as “increase in size with vocabulary”, “very high dimension — requires a lot of storage” and “less robust models”.

2.Reduce its dimensionality

One of the common dimension reduction method used was “Singular Value Decomposition (SVD)” . With dimension reduction, if we could construct the vector in the linear properties, it is possible to find syntactic as well as semantics patterns.

2. Prediction-based method

In prediction-based methods, we put the information about contexts into word vectors by teaching models to predict contexts.

Word2Vec

Word2Vec is the framework for learning work vectors. The basic idea is as follow:

We have a large corpus of text
Every word in a fixed vocabulary is represented by a vector
Initially, word vectors: one for “center” word and one for “outside” word — are randomly assigned
Go through each position t in the text, which has a center word c and context (“outside”) words o
Try to predict surrounding words using word vectors: Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa) and pass through softmax function to get the probability distribution of context words
Keep adjusting the word vectors to maximize this probability with stochastic gradient descent

Two model variants for Word2Vec:

1. Skip-grams (SG)

Predict the context words given the center word

For efficiency in training, negative sampling where only specific samples of weights are updated, can be implemented.

2. Continuous Bag of Words (CBOW)

Predict center word from (bag of) context words

So far, we have gone through count-based and prediction-based methods. They have their own pros and cons. In the subsequent section, we will get to know how GloVe model has incorporated two methods to achieve better result.

GloVe model

The GloVe model is the combination of count-based and prediction-based method. GloVe stands for Global vectors, which reflects its idea of using global information from corpus to learn vectors.

The GloVe has the two word vectors just as word2Vec and incorporated co-occurrence counts between words to construct the loss function. Additional weighting function is introduced in GloVe to controls the influence of rare and frequent words.

The GloVe model efficiently leverages global statistical information by training only on the nonzero elements in a wordword co-occurrence matrix, and produces a vector space with meaningful sub-structure. It consistently outperforms word2vec on the word analogy task, given the same corpus, vocabulary, window size, and training time. It achieves better results faster, and also obtains the best results irrespective of speed.

We have covered methods such as Word2Vec and GloVe to train and discover latent vector representations of natural language words in a semantic space. Next is how do we evaluate the quality of generated word vectors by those techniques?

Evaluation methods

There are two types of general evaluation in NLP: “Intrinsic” and “Extrinsic” evaluation methods.

Intrinsic Evaluation

Intrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by an embedding technique (such as Word2Vec or GloVe) on specific intermediate subtasks. These subtasks are typically simple and fast to compute and thereby allow us to help understand the system used to generate the word vectors. An intrinsic evaluation should typically return to us a number that indicates the performance of those word vectors on the evaluation subtask.

Types of intrinsic evaluation:

Word vector analogy (Observations: good dimension is ~300 and the more the data, the better the result is)
Correlation Evaluation

Extrinsic Evaluation

Extrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by an embedding technique on the real task at hand. These tasks are typically slow to compute performance. Typically, optimising over an underperforming extrinsic evaluation system does not allow us to determine which specific subsystem is at fault and this motivates the need for intrinsic evaluation.

I will be posting the journey of our study group and topics discussed regularly.

References & Further Readings

Stanford CS 224N | Natural Language Processing with Deep Learning

Natural language processing (NLP) is a crucial part of artificial intelligence (AI), modeling how people share…

web.stanford.edu

The Illustrated Word2vec

Discussions: Hacker News (347 points, 37 comments), Reddit r/MachineLearning (151 points, 19 comments) Translations…

jalammar.github.io

A simple Word2vec tutorial

In this tutorial we are going to explain, one of the emerging and prominent word embedding technique called Word2Vec…

medium.com

Word2vec from Scratch with NumPy

How to implement a Word2vec model with Python and NumPy

towardsdatascience.com

Co-occurrence matrix & Singular Value Decomposition(SVD)

Word Embeddings