Actions

A One Hot Vector: Difference between revisions

From Algolit

(Created page with "=one-hot vectors= [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] with one 0 for each word in a vocabulary where the 1 is representing the place of a word in the vector > In this kin...")
 
 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=one-hot vectors=
+
{|
 +
|-
 +
| Type: || Algoliterary exploration
 +
|-
 +
| Technique: || word-embeddings
 +
|-
 +
| Developed by: || Algolit
 +
|}
  
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
+
A '''one-hot-vector''' is a word-representation technique that uses ''distributional similarity'' to find patterns in the phrases that are used in company of a word. One-hot-vectors are basically a big matrix of 0's, with as many rows and columns as there are unique words. A text with 500 unique words, will be represented by a matrix of 500x500. With this matrix as its central tool, a script will go through the sentences of a dataset and count how often a word appears next to another word.
  
with one 0 for each word in a vocabulary
+
==Recipe for a one hot vector==
where the 1 is representing the place of a word in the vector
+
If this is our example sentence ...
> In this kind of vector representation: none of the words are similar, they are all a 1.
 
  
=Note that=
+
"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent.
+
 
there is research to multivectors for one word, so that it does not end up in the middle (but you get already far by using 1 dense vector/word)
+
... these are the 14 words we work with ...
 +
 
 +
a
 +
algoliterary
 +
as
 +
discovered
 +
disguised
 +
explores
 +
landscape
 +
made
 +
multidimensional
 +
numbers
 +
of
 +
the
 +
words
 +
.
 +
 
 +
... a single vector in a one-hot-vector looks like this ...
 +
 
 +
[0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 +
 
 +
... and a full fourteen-dimensional matrix like this ...
 +
 
 +
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0]  a
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  algoliterary
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  as
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  discovered
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  disguised
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  explores
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  landscape
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  made
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  multidimensional
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  numbers
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  of
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  the
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  words
 +
  [0 0 0 0 0 0 0 0 0 0 0 0 0 0]] .
 +
 
 +
... with one 0 for each unique word in a vocabulary, and a row for each unique word.
 +
 
 +
The following step is to count how often a word appears next to another ...
 +
 
 +
"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
 +
 
 +
[[0 0 0 1 0 0 0 0 1 0 0 0 0 0]  a
 +
  [0 0 0 0 0 1 0 0 0 0 0 1 0 0]  algoliterary
 +
  [0 0 0 0 1 0 0 0 0 1 0 0 0 0]  as
 +
  [1 0 0 0 0 1 0 0 0 0 0 0 0 0]  discovered
 +
  [0 0 1 0 0 0 0 0 0 0 0 0 1 0]  disguised
 +
  [0 1 0 1 0 0 0 0 0 0 0 0 0 0]  explores
 +
  [0 0 0 0 0 0 0 1 1 0 0 0 0 0]  landscape
 +
  [0 0 0 0 0 0 1 0 0 0 1 0 0 0]  made
 +
  [1 0 0 0 0 0 1 0 0 0 0 0 0 0]  multidimensional
 +
  [0 0 1 0 0 0 0 0 0 0 0 0 0 1]  numbers
 +
  [0 0 0 0 0 0 0 1 0 0 0 0 1 0]  of
 +
  [0 1 0 0 0 0 0 0 0 0 0 0 0 0]  the
 +
  [0 0 0 0 1 0 0 0 0 0 1 0 0 0]  words
 +
  [0 0 0 0 0 0 0 0 0 1 0 0 0 0]] .
 +
 
 +
==Algolit's one-hot-vector scripts==
 +
Two one-hot-vector scripts were created during one of the Algolit sessions, both creating the same matrix but in a different way. To download and run them, use the following links: [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_gijs.py one-hot-vector_gijs.py] & [https://gitlab.constantvzw.org/algolit/algolit/blob/master/algoliterary_encounter/one-hot-vector/one-hot-vector_hans.py one-hot-vector_hans.py]
 +
 
 +
==Note that==
 +
"''Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent. There is research to multivectors for one word, so that it does not end up in the middle.''" (Richard Socher, [https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf#6 CS224d, Deep Learning for Natural Language Processing at Stanford University (2016)].)  
 +
 
 +
For more notes on this lecture visit http://pad.constantvzw.org/public_pad/neural_networks_3
 +
 
 +
[[Category:Algoliterary-Encounters]]

Latest revision as of 22:05, 31 October 2017

Type: Algoliterary exploration
Technique: word-embeddings
Developed by: Algolit

A one-hot-vector is a word-representation technique that uses distributional similarity to find patterns in the phrases that are used in company of a word. One-hot-vectors are basically a big matrix of 0's, with as many rows and columns as there are unique words. A text with 500 unique words, will be represented by a matrix of 500x500. With this matrix as its central tool, a script will go through the sentences of a dataset and count how often a word appears next to another word.

Recipe for a one hot vector

If this is our example sentence ...

"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."

... these are the 14 words we work with ...

a
algoliterary
as
discovered
disguised
explores
landscape
made
multidimensional
numbers
of
the
words
.

... a single vector in a one-hot-vector looks like this ...

[0 0 0 0 0 0 0 0 0 0 0 0 0 0] 

... and a full fourteen-dimensional matrix like this ...

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0]  a
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  algoliterary
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  as
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  discovered
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  disguised
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  explores
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  landscape
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  made
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  multidimensional
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  numbers
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  of
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  the
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]  words
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]] .

... with one 0 for each unique word in a vocabulary, and a row for each unique word.

The following step is to count how often a word appears next to another ...

"The algoliterary explorers discovered a multidimensional landscape made of words disguised as numbers."
[[0 0 0 1 0 0 0 0 1 0 0 0 0 0]  a
 [0 0 0 0 0 1 0 0 0 0 0 1 0 0]  algoliterary
 [0 0 0 0 1 0 0 0 0 1 0 0 0 0]  as
 [1 0 0 0 0 1 0 0 0 0 0 0 0 0]  discovered
 [0 0 1 0 0 0 0 0 0 0 0 0 1 0]  disguised
 [0 1 0 1 0 0 0 0 0 0 0 0 0 0]  explores
 [0 0 0 0 0 0 0 1 1 0 0 0 0 0]  landscape
 [0 0 0 0 0 0 1 0 0 0 1 0 0 0]  made
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0]  multidimensional
 [0 0 1 0 0 0 0 0 0 0 0 0 0 1]  numbers
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0]  of
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]  the
 [0 0 0 0 1 0 0 0 0 0 1 0 0 0]  words
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]] .

Algolit's one-hot-vector scripts

Two one-hot-vector scripts were created during one of the Algolit sessions, both creating the same matrix but in a different way. To download and run them, use the following links: one-hot-vector_gijs.py & one-hot-vector_hans.py

Note that

"Words are represented once in a vector. So words with multiple meanings, like "bank", are more difficult to represent. There is research to multivectors for one word, so that it does not end up in the middle." (Richard Socher, CS224d, Deep Learning for Natural Language Processing at Stanford University (2016).)

For more notes on this lecture visit http://pad.constantvzw.org/public_pad/neural_networks_3