Word2vec basic.py: Difference between revisions
From Algolit
(→History) |
|||
Line 16: | Line 16: | ||
This is an annotated version of the basic word2vec script. The code is based on [https://www.tensorflow.org/tutorials/word2vec this Word2Vec tutorial] provided by Tensorflow. | This is an annotated version of the basic word2vec script. The code is based on [https://www.tensorflow.org/tutorials/word2vec this Word2Vec tutorial] provided by Tensorflow. | ||
− | =History= | + | ==History== |
Word2vec consists of related models used to generate vectors from words (also called [[word embeddings]]). It is a two-layer neural network, produced by a team of researchers led by [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Tomas Mikolov at Google]. | Word2vec consists of related models used to generate vectors from words (also called [[word embeddings]]). It is a two-layer neural network, produced by a team of researchers led by [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Tomas Mikolov at Google]. | ||
− | =word2vec_basic_algolit.py= | + | ==word2vec_basic_algolit.py== |
The structure of the annotated word2vec script is the following: | The structure of the annotated word2vec script is the following: | ||
Line 43: | Line 43: | ||
** '''Algolit adaption''': select 3 words to be included in the graph | ** '''Algolit adaption''': select 3 words to be included in the graph | ||
− | ==Source== | + | ===Source=== |
The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks. | The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks. | ||
Line 52: | Line 52: | ||
The book contains 153.003 words in total of which 19.869 words are unique. | The book contains 153.003 words in total of which 19.869 words are unique. | ||
− | ==wordlist.txt== | + | ===wordlist.txt=== |
From continuous text to list of words, exported as wordlist.txt. | From continuous text to list of words, exported as wordlist.txt. | ||
['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ] | ['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ] | ||
− | ==counted.txt== | + | ===counted.txt=== |
From list of words to a list with the structure [(word, value)], exported as counted.txt. | From list of words to a list with the structure [(word, value)], exported as counted.txt. | ||
[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ] | [['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ] | ||
− | ==dictionary.txt== | + | ===dictionary.txt=== |
Reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number, exported as dictionary.txt. | Reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number, exported as dictionary.txt. | ||
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... } | {0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... } | ||
− | ==data.txt== | + | ===data.txt=== |
The object ''data'' is created, the original texts where words are replaced with index numbers, exported as data.txt. | The object ''data'' is created, the original texts where words are replaced with index numbers, exported as data.txt. | ||
[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ] | [0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ] | ||
− | ==disregarded.txt== | + | ===disregarded.txt=== |
List of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt. | List of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt. | ||
['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ] | ['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ] | ||
− | ==reversed-input.txt== | + | ===reversed-input.txt=== |
Reversed version of the initial dataset, where all the disregard words are replaced with ''UNK'' (unkown), exported as reversed-input.txt. | Reversed version of the initial dataset, where all the disregard words are replaced with ''UNK'' (unkown), exported as reversed-input.txt. | ||
UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ... | UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ... | ||
− | ==big-random-matrix.txt== | + | ===big-random-matrix.txt=== |
A big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt. | A big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt. | ||
Line 111: | Line 111: | ||
5.12778997e-01 7.89849758e-01 2.42011547e-02 -2.77193785e-01] ... ] | 5.12778997e-01 7.89849758e-01 2.42011547e-02 -2.77193785e-01] ... ] | ||
− | ==training-words.txt== | + | ===training-words.txt=== |
Export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt. | Export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt. | ||
Line 128: | Line 128: | ||
['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut'] | ['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut'] | ||
− | ==training-window-words.txt== | + | ===training-window-words.txt=== |
Export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt. | Export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt. | ||
Line 137: | Line 137: | ||
['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was'] | ['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was'] | ||
− | ==cosine similarity calculation updates== | + | ===cosine similarity calculation updates=== |
Visualisation of the cosine similarity calculation updates. | Visualisation of the cosine similarity calculation updates. | ||
... | ... | ||
− | ==logfile.txt== | + | ===logfile.txt=== |
Save training log, exported as logfile.txt. | Save training log, exported as logfile.txt. | ||
Revision as of 11:19, 30 October 2017
Type: | Algolit extension |
Datasets: | Tristes Tropiques |
Technique: | word embeddings |
Developed by: | a team of researchers led by Tomas Mikolov at Google, Claude Lévi-Strauss, Algolit |
This is an annotated version of the basic word2vec script. The code is based on this Word2Vec tutorial provided by Tensorflow.
History
Word2vec consists of related models used to generate vectors from words (also called word embeddings). It is a two-layer neural network, produced by a team of researchers led by Tomas Mikolov at Google.
word2vec_basic_algolit.py
The structure of the annotated word2vec script is the following:
- Step 1: Download data.
- Algolit step 1: read data from plain text file
- Algolit inspection: wordlist.txt
- Step 2: Create a dictionary and replace rare words with UNK token.
- Algolit inspection: counted.txt
- Algolit inspection: dictionary.txt
- Algolit inspection: data.txt
- Algolit inspection: disregarded.txt
- Algolit adaption: reversed-input.txt
- Step 3: Function to generate a training batch for the skip-gram model
- Step 4: Build and train a skip-gram model.
- Algolit inspection: big-random-matrix.txt
- Algolit adaption: select your own set of test words
- Step 5: Begin training.
- Algolit inspection: training-words.txt
- Algolit inspection: training-window-words.txt
- Algolit adaption: visualisation of the cosine similarity calculation updates
- Algolit inspection: logfile.txt
- Step 6: Visualize the embeddings.
- Algolit adaption: select 3 words to be included in the graph
Source
The script word2vec_basic.py provides an option to download a dataset from Matt Mahoney's home page. It turns out to be a plain text document, without any punctuation or line breaks.
For the tests that we wanted to do with the script, we decided to work with a piece of academic literature instead: Tristes Tropiques, written by Claude Lévi-Strauss and translated by John Russell. (https://archive.org/details/tristestropiques000177mbp).
Before we could use Lévi-Strauss' text as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script text-punctuation-clean-up.py. The script saves a *stripped* version of the original book under another filename.
The book contains 153.003 words in total of which 19.869 words are unique.
wordlist.txt
From continuous text to list of words, exported as wordlist.txt.
['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]
counted.txt
From list of words to a list with the structure [(word, value)], exported as counted.txt.
[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]
dictionary.txt
Reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number, exported as dictionary.txt.
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }
data.txt
The object data is created, the original texts where words are replaced with index numbers, exported as data.txt.
[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]
disregarded.txt
List of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt.
['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]
reversed-input.txt
Reversed version of the initial dataset, where all the disregard words are replaced with UNK (unkown), exported as reversed-input.txt.
UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...
big-random-matrix.txt
A big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt.
[[ 2.85661697e-01 9.69764948e-01 -7.59074926e-01 -6.15304947e-01 6.77072048e-01 -3.78361940e-01 -6.71523094e-01 3.94770384e-01 7.04541206e-02 -8.92262936e-01 5.87280035e-01 4.58304882e-02 2.53162384e-01 1.90168381e-01 -6.61255836e-01 -3.75634432e-01 -5.55147886e-01 4.49278116e-01 3.26536417e-01 8.64576340e-01] [ -6.70668364e-01 -5.53100824e-01 -3.71278524e-01 1.25042677e-01 -1.46459818e-01 -6.10010624e-01 9.19621468e-01 -1.55832767e-01 -7.70623922e-01 -1.44968033e-01 -6.36267662e-01 -1.87215090e-01 7.09211111e-01 -6.57156706e-01 3.26824188e-02 -4.25864220e-01 -5.86277485e-01 8.16827059e-01 -5.57327747e-01 -3.35038900e-01] [ -9.33161497e-01 8.45068693e-01 -8.14761639e-01 -5.67158937e-01 5.23060560e-01 4.90430593e-01 -9.11595106e-01 4.36383963e-01 -9.69607353e-01 -6.64181471e-01 -4.44166183e-01 7.78196335e-01 -5.34924030e-01 6.49461985e-01 5.69838047e-01 2.50927448e-01 -8.87476921e-01 -3.74064207e-01 4.24978733e-02 1.25571489e-01] [ 9.89913464e-01 3.36525917e-01 -1.86083794e-01 -5.25027514e-01 -8.87480021e-01 8.53247643e-02 4.10822868e-01 3.29172134e-01 8.56166363e-01 5.12266636e-01 7.75470734e-01 7.89757490e-01 -9.44452286e-02 -8.79762173e-01 1.57778263e-02 -8.59814644e-01 4.55990076e-01 4.06166315e-01 -8.40348721e-01 -2.75753498e-01] [ 5.79052448e-01 -3.62973213e-01 -8.79675150e-01 -9.98473167e-01 -1.73240185e-01 7.07520723e-01 4.95352268e-01 4.99097586e-01 -5.02996445e-02 -4.01979208e-01 5.94721079e-01 7.37986326e-01 -6.61164761e-01 6.45744085e-01 -4.68054295e-01 -5.54257870e-01 5.12778997e-01 7.89849758e-01 2.42011547e-02 -2.77193785e-01] ... ]
training-words.txt
Export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt.
[2831 2831 1906 1906 25 25 1 1 221 221 37 37 1 1 1840 1840 655 655 3 3 22 22 971 971 4 4 1 1 481 481 4235 4235 297 297 0 0 7 7 1343 1343 16 16 53 53 172 172 1 1 1080 1080 1831 1831 0 0 2 2 0 0 1804 1804 1 1 590 590 653 653 3 3 16 16 489 489 2 2 7 7 8 8 5 5 0 0 56 56 1313 1313 13 13 14 14 44 44 3432 3432 6 6 1 1 98 98 744 744 23 23 16 16 489 489 56 56 85 85 4 4 224 224 5 5 0 0 1080 1080 1 1 0 0 474 474]
Or in words:
['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']
training-window-words.txt
Export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt.
[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]
Or in words:
['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']
cosine similarity calculation updates
Visualisation of the cosine similarity calculation updates.
...
logfile.txt
Save training log, exported as logfile.txt.
Nearest to collective: Beyond, Although, luxury, confirmed, pointless, Born, colour, stick, scattered, somewhere,
Nearest to being: direcdy, appropriate, 8000, muito, disgusting, broad, southeast, Longer, completed, Before,
Nearest to social: photograph, Working, Hung, coasts, teacher, skins, cuts, extent, sheets, worth,
Nearest to collective: manioc, colour, work, grass, simply, adopted, it, particular, groups, concerned,
Nearest to being: jaguar, said, longer, sky, adopted, this, design, From, better, Longer,
Nearest to social: fall, make, photograph, yellow, given, than, took, men, worth, clouds,
Nearest to collective: manioc, colour, work, simply, grass, adopted, Beyond, horizons, particular, position,
Nearest to being: Longer, said, adopted, jaguar, longer, design, Before, sky, From, completed,
Nearest to social: photograph, fall, yellow, make, Hung, skins, given, worth, extent, teacher,
...
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, horizons, tongue, occupied, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, why, goiania, Behind, Children, negative,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, horizons, lower, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, feeling, northern, humanity, derisory,
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, lower, unexpected, horizons,
Nearest to being: Difficulty, maintained, mass, control, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,