Actions

GloVe dataset

From Algolit

Revision as of 13:28, 25 October 2017 by Cristina (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


The GloVe Reader shows one of the pre-trained word datasets that are used for machine learning modelling, such as We are a Sentiment Thermometer. GloVe is an algorithm that looks for co-occurences in large text files. It then creates a semantic map of words, in which similar words come together as little islands. This mapping is packaged as a textfile of 5GB large and 1.917.494 lines of 300 numbers per word.

The GloVe file is ordered by frequency of words. For the purpose of the exhibition, we rearranged the words in alphabetical order. Even if the Reader would show 60 words per second, it would take 8 hours to vision the entire file. We launch the script at the beginning of the day. The alphabetical order gives you a glance of where the Reader is situated in the file.

GloVe was developed in 2014 by Jeffrey Pennington, Richard Socher and Christopher D. Manning, researchers at the Computer Science Department of Stanford University in California.

The GloVe Reader uses 75% of the existing webpages of the Internet. The content scrape was realised by Common Crawl, an NGO based in California. The people of Common Crawl believe the internet should be available to download by anyone.

Download GloVe datasets: https://nlp.stanford.edu/projects/glove/