Actions

Many many words

From Algolit

Revision as of 13:30, 25 October 2017 by Cristina (talk | contribs)

Many many words

To compare the size of the dataset used, we did a page count of this library. With a small script the whole catalogue of the Biblio de Saint-Gilles was read and pages were counted. The catalogue lists 43673 items, of which 42759 are printed text: Historique des recherches = (Recherche simple : terme * dans Tous les champs pour tous les types de documents) Et Type de document = (texte imprimé) 42759 résultat(s)

For 28163 of these books the amount of pages was indicated and could be counted. Our small script did a nightly reading of the library catalogue. This gave a count of 6409431 pages for 28163 books.

A book page contains generally between 200 and 600 words, with an estimated average of 450 words (Arial font size 12, single line spacing - source: https://wordcounter.net/words-per-page). This gives an estimate of 2884243950 or approx. 2.9 billion words for these 6409431 pages or 28163 books. On average this gives 102400 words or 227 pages per book. Extrapolated to the whole set of 42759 books in this library this gives approximately 10 million pages and 4.4 milliard (billion) words.

The datasets from which the word embeddings are derived are the Common Crawl datasets. The large set has 840B tokens or words used in the texts read, which compares to approx. 1.9 billion pages. The smaller set has 42B tokens, or approx. 90 million pages. In other words, to learn the word embeddings in the glove.42B-dataset approx. 9 times the amount of text in the Biblio de Saint-Gilles was read by the computer. For the larger glove.840B-dataset the amount of text was even 36 times the Biblio de Saint-Gilles. Computers read fast but are slow learners.

This resulted in a 1.9 million vocabulary of distinct words, with each 300 values associated. The larger crawl resulted in a 2.2 million vocabulary. Printing one word with all 300 values on one page would result in 1.9 or 2.2 million pages or about 20% of the Biblio de Saint-Gilles. Even if we would go for small print and put 2 words with their values on one page, it would remain about a million of pages or 10% of the library. Printing all 1.9 million words, with each word on a line of 4 mm height, would result in a paper roll of 7600m.

We therefore kept the wordspace used by the computer virtual and provide some alternative peeks into this language universum.