Actions

Difference between revisions of "A One Hot Vector"

From Algolit

 
(9 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
|}
 
|}
  
"''Meaning is this illusive thing that were trying to capture''" (Richard Socher in [https://www.youtube.com/watch?v=xhHOL3TNyJs&index=2&list=PLcGUo322oqu9n4i0X3cRJgKyVy7OkDdoi CS224D Lecture 2 - 31st Mar 2016 (Youtube)])
+
A '''one-hot-vector''' is a word-representation technique that uses ''distributional similarity'' to find patterns in the phrases that are used in company of a word. One-hot-vectors are basically a big matrix of 0's, with as many rows and columns as there are unique words. A text with 500 unique words, will be represented by a matrix of 500x500. With this matrix as its central tool, a script will go through the sentences of a dataset and count how often a word appears next to another word.  
<br>
 
 
 
Word embeddings are used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact string matches.
 
 
 
Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that one can download. Two very well-known datasets of pre-trained English word embeddings are word2vec, pre-trained on Google News data, and [http://www.algolit.net/index.php/The_GloVe_Reader GloVe], pre-trained on the [http://www.algolit.net/index.php/Common_Crawl Common Crawl] of web pages.
 
 
 
The term has only recently entered the vocabulary of machine learning, with the expansion of the deep learning community. In computational linguistics the expression 'distributional semantic model' is sometimes preferred. Other terms include 'distributed representation', 'semantic vector space', or 'word space'.
 
 
 
Two popular examples of standalone implementations are the word2vec library (a single layered neural network) and the [http://www.algolit.net/index.php/The_GloVe_Reader GloVe] library (distributional semantic model).
 
 
 
=Making a one-hot-vector=
 
  
 +
==Recipe for a one hot vector==
 
If this is our example sentence ...
 
If this is our example sentence ...
  
<br>
 
 
  "The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
 
  "The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
<br>
 
  
 
... these are the 14 words we work with ...
 
... these are the 14 words we work with ...
  
<br>
 
 
  a
 
  a
 
  algoliterary
 
  algoliterary
Line 44: Line 31:
 
  words
 
  words
 
  .
 
  .
<br>
 
  
 
... a single vector in a one-hot-vector looks like this ...
 
... a single vector in a one-hot-vector looks like this ...
  
<br>
 
 
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  
 
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  
<br>
 
  
 
... and a full fourteen-dimensional matrix like this ...
 
... and a full fourteen-dimensional matrix like this ...
  
<br>
 
 
  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0]  a
 
  [[0 0 0 0 0 0 0 0 0 0 0 0 0 0]  a
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  algoliterary
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  algoliterary
Line 69: Line 52:
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  words
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  words
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]] .
 
   [0 0 0 0 0 0 0 0 0 0 0 0 0 0]] .
<br>
 
  
 
... with one 0 for each unique word in a vocabulary, and a row for each unique word.  
 
... with one 0 for each unique word in a vocabulary, and a row for each unique word.  
Line 75: Line 57:
 
The following step is to count how often a word appears next to another ...
 
The following step is to count how often a word appears next to another ...
  
<br>
 
 
  "The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
 
  "The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
<br>
 
  
<br>
 
 
  [[0 0 0 1 0 0 0 0 1 0 0 0 0 0]  a
 
  [[0 0 0 1 0 0 0 0 1 0 0 0 0 0]  a
 
   [0 0 0 0 0 1 0 0 0 0 0 1 0 0]  algoliterary
 
   [0 0 0 0 0 1 0 0 0 0 0 1 0 0]  algoliterary
Line 94: Line 73:
 
   [0 0 0 0 1 0 0 0 0 0 1 0 0 0]  words
 
   [0 0 0 0 1 0 0 0 0 0 1 0 0 0]  words
 
   [0 0 0 0 0 0 0 0 0 1 0 0 0 0]] .
 
   [0 0 0 0 0 0 0 0 0 1 0 0 0 0]] .
<br>
 
 
==Algolit one-hot-vector scripts==
 
  
 +
==Algolit's one-hot-vector scripts==
 
Two one-hot-vector scripts were created during one of the Algolit sessions, both creating the same matrix but in a different way. To download and run them, use the following links: [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_gijs.py one-hot-vector_gijs.py] & [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_hans.py one-hot-vector_hans.py]
 
Two one-hot-vector scripts were created during one of the Algolit sessions, both creating the same matrix but in a different way. To download and run them, use the following links: [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_gijs.py one-hot-vector_gijs.py] & [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_hans.py one-hot-vector_hans.py]
  
=Note that=
+
==Note that==
"''Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent. There is research to multivectors for one word, so that it does not end up in the middle.''" (Richard Socher, idem.)]  
+
"''Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent. There is research to multivectors for one word, so that it does not end up in the middle.''" (Richard Socher, [https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf#6 CS224d, Deep Learning for Natural Language Processing at Stanford University (2016)].)
  
 
For more notes on this lecture visit http://pad.constantvzw.org/public_pad/neural_networks_3
 
For more notes on this lecture visit http://pad.constantvzw.org/public_pad/neural_networks_3
 
 
  
 
[[Category:Algoliterary-Encounters]]
 
[[Category:Algoliterary-Encounters]]

Latest revision as of 22:05, 31 October 2017

Type: Algoliterary exploration
Technique: word-embeddings
Developed by: Algolit

A one-hot-vector is a word-representation technique that uses distributional similarity to find patterns in the phrases that are used in company of a word. One-hot-vectors are basically a big matrix of 0's, with as many rows and columns as there are unique words. A text with 500 unique words, will be represented by a matrix of 500x500. With this matrix as its central tool, a script will go through the sentences of a dataset and count how often a word appears next to another word.

Recipe for a one hot vector

If this is our example sentence ...

"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."

... these are the 14 words we work with ...

a
algoliterary
as
discovered
disguised
explores
landscape
made
multidimensional
numbers
of
the
words
.

... a single vector in a one-hot-vector looks like this ...

[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 

... and a full fourteen-dimensional matrix like this ...

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0]  a
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  algoliterary
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  as
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  discovered
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  disguised
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  explores
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  landscape
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  made
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  multidimensional
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  numbers
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  of
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  the
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  words
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]] .

... with one 0 for each unique word in a vocabulary, and a row for each unique word.

The following step is to count how often a word appears next to another ...

"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
[[0 0 0 1 0 0 0 0 1 0 0 0 0 0]  a
 [0 0 0 0 0 1 0 0 0 0 0 1 0 0]  algoliterary
 [0 0 0 0 1 0 0 0 0 1 0 0 0 0]  as
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0]  discovered
 [0 0 1 0 0 0 0 0 0 0 0 0 1 0]  disguised
 [0 1 0 1 0 0 0 0 0 0 0 0 0 0]  explores
 [0 0 0 0 0 0 0 1 1 0 0 0 0 0]  landscape
 [0 0 0 0 0 0 1 0 0 0 1 0 0 0]  made
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0]  multidimensional
 [0 0 1 0 0 0 0 0 0 0 0 0 0 1]  numbers
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0]  of
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]  the
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0]  words
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]] .

Algolit's one-hot-vector scripts

Two one-hot-vector scripts were created during one of the Algolit sessions, both creating the same matrix but in a different way. To download and run them, use the following links: one-hot-vector_gijs.py & one-hot-vector_hans.py

Note that

"Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent. There is research to multivectors for one word, so that it does not end up in the middle." (Richard Socher, CS224d, Deep Learning for Natural Language Processing at Stanford University (2016).)

For more notes on this lecture visit http://pad.constantvzw.org/public_pad/neural_networks_3