Word embedding projector

The projector from Google Tensorflow-package allows to visualize a multidimensional space by projecting it into a 2 or 3 dimensional space. This allows us to peek into the wordspace formed by the word embeddings from the datasets we use (in this ex. the glove.42B dataset).

A word embedding associates a word with coordinates in a multi-dimensional space. The Glove.42B dataset uses 300 dimensions, therefore it contains a matrix or tensor of 1.9 million (the vocabulary size) by 300 (the coordinates of each word). For the computer differences between words are thereby expressed by differences in 300 variables.

Such large-dimensional spaces are for a human impossible to perceive visually. Some mathematical techniques exists to make specific projections of such a space into lower-dimensional spaces (in analogy to the use of perspective to visualize a 3 dimensional space on a 2-dimensional space or a plane).

The Tensorflow projector uses Principal Component Analysis (PCA) to create a projection into the 2 or 3 dimension in which the greatest variance of the dataset can be expressed. PCA does not change the word embeddings but only changes the point of view by rotating the axes of the space to make sure that the first dimensions show the largest variance (= the largest differences between the words). Next these first 2 or 3 dimensions are shown on the screen. On the left panel it is indicated how much of the variance is expressed in this projection.

The Tensorflow projector also provides a t-SNE projection. t-distributed stochastic neighbor embedding (t-SNE) does not show the original wordspace, but shows a probability distribution in 2 or 3 dimensions of words being similar or not. Words being similar, or near each other in the word embedding space, will be shown near each other in the projection, while words which are dissimilar or shown far apart from each other. In other words, the t-SNE projection tries to preserve the relative distances between the words in the 300-dimensional word embedding space in the 2 or 3D projection.

Both projections give us a peek into what language means when it is perceived by the computer through algorithms creating word embeddings (like Glove or word2vec). (Dis)similarity in words is expressed by the distance of the words. Associations between words present in the original texts by co-occurences of words will be reflected in the distances in the word embedding space. They can be explored visually through these projections, or mathematically by calculation the distances in the word embedding space.

The projection shows not the whole dataset, but a selection of 10000 words (or less). With the mouse you can rotate the view and zoom in or out. Hovering over a point makes visible the word. Clicking on a word makes visible the word with its nearest neighbours in the word embedding space. The amount of neighbours shown can be set on the right panel, as wel as the distance measure used for this calculation. Clicking on 'Isolate the ... words' shows only the word with its nearest neighbours. It is also possible to make a search for a specific word.

Further visual explorations of a word embedding dataset can be done by loading a different selection of words shown. Similar explorations can be done by loading different word embedding datasets.

Word embedding Projector

From Algolit

Word embedding projector