Actions

About Word embeddings: Difference between revisions

From Algolit

(Created page with "{| |- | Type: || Algoliterary exploration |- | Technique: || word-embeddings |- | Developed by: || Algolit |} "''Meaning is this illusive thing that were trying to capture''"...")
 
 
(4 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
| Type: || Algoliterary exploration
 
| Type: || Algoliterary exploration
 
|-
 
|-
| Technique: || word-embeddings
+
| Technique: || [[word embeddings]]
 
|-
 
|-
 
| Developed by: || Algolit
 
| Developed by: || Algolit
Line 11: Line 11:
 
<br>
 
<br>
  
Word embeddings are used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact string matches.
+
Word embeddings are used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact letter matches.
  
Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that one can download. Two very well-known datasets of pre-trained English word embeddings are word2vec, pre-trained on Google News data, and [http://www.algolit.net/index.php/The_GloVe_Reader GloVe], pre-trained on the [http://www.algolit.net/index.php/Common_Crawl Common Crawl] of web pages.
+
Common assumption in this approach is that the co-occurrence of words in each others neighborhood in texts shows a relation or similarity in meaning. While bag-of-words looks to the word frequency in the whole text, these approaches do a frequency count of words in a small interval around each word. Several algorithms have been developed to transform such local co-occurrence counts into word-embeddings, like word2vec (a single layered neural network) and GloVe (distributional semantic model).
 +
 
 +
Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that one can download. Two very well-known datasets of pre-trained English word embeddings are word2vec pre-trained on Google News data, and [http://www.algolit.net/index.php/The_GloVe_Reader GloVe] pre-trained on the [http://www.algolit.net/index.php/Common_Crawl Common Crawl] of web pages.
  
 
The term has only recently entered the vocabulary of machine learning, with the expansion of the deep learning community. In computational linguistics the expression 'distributional semantic model' is sometimes preferred. Other terms include 'distributed representation', 'semantic vector space', or 'word space'.
 
The term has only recently entered the vocabulary of machine learning, with the expansion of the deep learning community. In computational linguistics the expression 'distributional semantic model' is sometimes preferred. Other terms include 'distributed representation', 'semantic vector space', or 'word space'.
 
Two popular examples of standalone implementations are the word2vec library (a single layered neural network) and the [http://www.algolit.net/index.php/The_GloVe_Reader GloVe] library (distributional semantic model).
 

Latest revision as of 18:09, 25 October 2017

Type: Algoliterary exploration
Technique: word embeddings
Developed by: Algolit

"Meaning is this illusive thing that were trying to capture" (Richard Socher in CS224D Lecture 2 - 31st Mar 2016 (Youtube))

Word embeddings are used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact letter matches.

Common assumption in this approach is that the co-occurrence of words in each others neighborhood in texts shows a relation or similarity in meaning. While bag-of-words looks to the word frequency in the whole text, these approaches do a frequency count of words in a small interval around each word. Several algorithms have been developed to transform such local co-occurrence counts into word-embeddings, like word2vec (a single layered neural network) and GloVe (distributional semantic model).

Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that one can download. Two very well-known datasets of pre-trained English word embeddings are word2vec pre-trained on Google News data, and GloVe pre-trained on the Common Crawl of web pages.

The term has only recently entered the vocabulary of machine learning, with the expansion of the deep learning community. In computational linguistics the expression 'distributional semantic model' is sometimes preferred. Other terms include 'distributed representation', 'semantic vector space', or 'word space'.