Many many words
From Algolit
To compare the size of the datasets we used, we did a page count of this library. With a small script the whole catalogue of the Biblio de Saint-Gilles was read and pages were counted. The catalogue lists 43.673 items, of which 42.759 are printed text: Historique des recherches = (Recherche simple : terme * dans Tous les champs pour tous les types de documents) Et Type de document = (texte imprimé) - 42759 résultat(s)
For 28163 of these books the amount of pages was indicated and could be counted. Our small script did a nightly reading of the library catalogue. This gave a count of 6.409.431 pages for 28.163 books.
A book page contains generally between 200 and 600 words, with an estimated average of 450 words (Arial font size 12, single line spacing - source: https://wordcounter.net/words-per-page). This gives an estimate of 2.884.243.950 or approx. 2.9 billion words for these 6.409.431 pages or 28.163 books. On average this gives 102.400 words or 227 pages per book. Extrapolated to the whole set of 42.759 books in this library this gives approximately 10 million pages and 4.4 milliard (billion) words.
Many many words of GloVe
We mainly used the GloVe pretrained word embeddings datasets. These word embeddings are based on the Common Crawl text data. The large set has 840B tokens or words used in the texts read, which compares to approx. 1.9 billion pages. The smaller set has 42B tokens, or approx. 90 million pages. In other words, to learn the word embeddings in the glove.42B-dataset, the computer read about 9 times the amount of text in the Biblio de Saint-Gilles. For the larger glove.840B-dataset the computer read 36 times the Biblio de Saint-Gilles. Computers read fast but are slow learners.
The GloVe training resulted in a 1.9 million vocabulary of distinct words, with each 300 values associated. The larger crawl resulted in a 2.2 million vocabulary. Printing one word with all 300 values on one page would result in 1.9 or 2.2 million pages or about 20% of the Biblio de Saint-Gilles. Even if we would go for small print and put 2 words with their values on one page, it would remain about a million of pages or 10% of the library. Printing all 1.9 million words, with each word on a line of 4 mm height, would result in a paper roll of 7600m.
We therefore kept the wordspace used by the computer virtual and decide to provide some alternative peeks into this language universe.