Actions

Word2vec basic.py: Difference between revisions

From Algolit

Line 1: Line 1:
 
[[File:Tsne20.png|thumb|right|Graph generated by the word2vec_basic.py example script, trained on the book "Mankind in the Making" by H.G. Wells.]]
 
[[File:Tsne20.png|thumb|right|Graph generated by the word2vec_basic.py example script, trained on the book "Mankind in the Making" by H.G. Wells.]]
The following annotated script is based on the Word2Vec tutorial provided by Tensorflow: https://www.tensorflow.org/tutorials/word2vec.
+
 
 +
This is an annotated version of the basic word2vec script. The code is based on the Word2Vec tutorial provided by Tensorflow: https://www.tensorflow.org/tutorials/word2vec.
 +
 
 +
=History=
 +
Word2vec is a neural network
 +
 
 +
=word2vec_basic_algolit.py=
 +
The structure of the annotated word2vec script is the following:
 +
 
 +
* Step 1: Download data. (optional)
 +
* Algolit step 1: read data from plain text file
 +
* Step 2: Create a dictionary and replace rare words with UNK token.
 +
* Step 3: Function to generate a training batch for the skip-gram model.
 +
* Step 4: Build and train a skip-gram model.
 +
* Step 5: Begin training.
 +
** Algolit extension: write training log to a text document
 +
* Step 6: Visualize the embeddings.
 +
 
  
 
==Source==
 
==Source==
 +
The script provides an option to download a dataset from
 +
 
* original source in the script: http://mattmahoney.net/dc/text8.zip
 
* original source in the script: http://mattmahoney.net/dc/text8.zip
 
* Mankind in the Making by H. G. Wells - http://www.gutenberg.org/ebooks/7058
 
* Mankind in the Making by H. G. Wells - http://www.gutenberg.org/ebooks/7058

Revision as of 16:24, 3 October 2017

Graph generated by the word2vec_basic.py example script, trained on the book "Mankind in the Making" by H.G. Wells.

This is an annotated version of the basic word2vec script. The code is based on the Word2Vec tutorial provided by Tensorflow: https://www.tensorflow.org/tutorials/word2vec.

History

Word2vec is a neural network

word2vec_basic_algolit.py

The structure of the annotated word2vec script is the following:

  • Step 1: Download data. (optional)
  • Algolit step 1: read data from plain text file
  • Step 2: Create a dictionary and replace rare words with UNK token.
  • Step 3: Function to generate a training batch for the skip-gram model.
  • Step 4: Build and train a skip-gram model.
  • Step 5: Begin training.
    • Algolit extension: write training log to a text document
  • Step 6: Visualize the embeddings.


Source

The script provides an option to download a dataset from

Dictionary

A snippet from the dictionary.txt file:

0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'is', 8: 'that', 9: 'it', 10: 'be', 11: 'for', 12: 'as', 13: 'are', 14: 'with', 15: 'not', 16: 'this', 17: 'or', 18: 'will', 19: 'at', 20: 'we', 21: 'but', 22: 'by', 23: 'may', 24: 'his', 25: 'all', 26: 'an', 27: 'these', 28: 'they', 29: 'have', 30: 'he', 31: 'from', 32: 'our', 33: 'has', 34: 'The', 35: 'no', 36: 'more', 37: 'which', 38: 'one', 39: 'there', 40: 'would', 41: 'its', 42: 'so', 43: 'their', 44: 'than', 45: 'children', 46: 'very', 47: 'things', 48: 'any', 49: 'upon', 50: 'i', 51: 'can', 52: 'if', 53: 'do', 54: 'who', 55: 'child', 56: 'new', 57: 'life', 58: 'It', 59: 'should', 60: 'them', 61: 'only', 62: 'world', 63: 'must', 64: 'on', 65: 'such', 66: 'great', 67: 'people', 68: 'man', 69: 'into', 70: 'most', 71: 'out', 72: 'little', 73: 'what', 74: 'was', 75: 'every', 76: 'some', 77: 'much', 78: 'certain', 79: 'And', 80: 'about', 81: 'men', 82: 'english', 83: 'far', 84: 'present', 85: 'first', 86: 'many', 87: 'been', 88: 'thing', 89: 'those', 90: 'home', 91: 'good', 92: 'But', 93: 'quite', 94: 'way', 95: 'might', 96: 'other', 97: 'us', 98: 'general', 99: 'They', 100: 'social',

Logs