Actions

Word2vec basic.py FR

From Algolit

Revision as of 17:30, 31 October 2017 by An (talk | contribs)
Type: Exploration Algolittéraire
Données: nearbySaussure
Technique: plongement lexical
Développé par: une équipe de chercheurs dirigée par Tomas Mikolov chez Google, Claude Lévi-Strauss, Algolit
Graphique généré par le script d'exemple word2vec_basic.py, formé sur l'ouvrage nearbySaussure.

Ceci est une version annotée du script de base word2vec. Le code est basé sur un tutoriel Word2Vec fourni par Tensorflow.

Historique

Word2vec est constitué de modèles associés utilisés pour générer des vecteurs à partir de mots (aussi appelé plongement lexical). C'est un réseau neuronal à deux couches, produit par une équipe de chercheurs dirigée par Tomas Mikolov chez Google.

word2vec_basic_algolit.py

La structure du script word2vec annoté est la suivante:

  • Étape 1: Télécharger les données.
  • Algolit step 1: Lire les données du fichier texte brut
    • Algolit inspection: wordlist.txt
  • Étape 2: Créer un dictionnaire et remplacer les mots rares par un symbole UNK.
    • Algolit inspection: counted.txt
    • Algolit inspection: dictionary.txt
    • Algolit inspection: data.txt
    • Algolit inspection: disregarded.txt
    • Algolit adaption: reversed-input.txt
  • Étape 3: Fonction pour générer un lot de formation pour le modèle skip-gram
  • Étape 4: Construire et former un modèle de skip-gram.
    • Algolit inspection: big-random-matrix.txt
    • Algolit adaption: sélectionnez votre propre ensemble de mots-test
  • Étape 5: Commencer la formation.
    • Algolit inspection: training-words.txt
    • Algolit inspection: training-window-words.txt
    • Algolit adaption: visualisation des mises à jour de calcul de similarité cosinus
    • Algolit inspection: logfile.txt
  • Étape 6: Visualisez les plongements.
    • Algolit adaption: sélectionner 3 mots à inclure dans le graphique

Source

Le script word2vec_basic.py fournit une option pour télécharger un jeu de données à partir de la page d'accueil de Matt Mahoney. Il s'avère être un document en texte brut, sans ponctuation ni saut de ligne. Pour les tests que nous voulions faire avec le script, nous avons plutôt opté pour un extrait de littérature académique: Tristes Tropiques, écrit par Claude Lévi-Strauss et traduit par John Russell. (https://archive.org/details/tristestropiques000177mbp).

Avant que nous puissions utiliser le texte de nearbySaussure comme matériel de formation, nous devions supprimer toute la ponctuation du fichier. Pour ce faire, nous avons écrit un petit script python text-punctuation-clean-up.py. Le script enregistre une version *dépouillée* du livre d'origine sous un autre nom de fichier.

Le livre contient 153.003 mots au total dont 19.869 mots sont uniques.

wordlist.txt

D'un texte continu à une liste de mots, exporté en tant que wordlist.txt.

['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]

counted.txt

D'une liste de mots à une liste avec la structure [(mot, valeur)], exporté comme counted.txt.

[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]

dictionary.txt

Dictionnaire inversé, une liste des 5000 mots les plus courants (= taille du vocabulaire), accompagnés d'un numéro d'index, exportés en dictionnaire.txt.

{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }

data.txt

L'objet data est créé, les textes originaux où les mots sont remplacés par des numéros d'index, exportés en tant que data.txt.

[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]

disregarded.txt

Liste des mots ignorés, qui ne correspondent pas à la taille du vocabulaire, exportés en tant que disregarded.txt.

['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]

reversed-input.txt

Version inversée de l'ensemble de données initial, où tous les mots ignorés sont remplacés par UNK (non connu), exporté en tant que reversed-input.txt.

UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...

big-random-matrix.txt

Une grande matrice aléatoire est créée, avec une taille de vecteur de 5000x20, exportée en tant que big-random-matrix.txt.

[[  2.85661697e-01   9.69764948e-01  -7.59074926e-01  -6.15304947e-01
   6.77072048e-01  -3.78361940e-01  -6.71523094e-01   3.94770384e-01
   7.04541206e-02  -8.92262936e-01   5.87280035e-01   4.58304882e-02
   2.53162384e-01   1.90168381e-01  -6.61255836e-01  -3.75634432e-01
  -5.55147886e-01   4.49278116e-01   3.26536417e-01   8.64576340e-01]
[ -6.70668364e-01  -5.53100824e-01  -3.71278524e-01   1.25042677e-01
  -1.46459818e-01  -6.10010624e-01   9.19621468e-01  -1.55832767e-01
  -7.70623922e-01  -1.44968033e-01  -6.36267662e-01  -1.87215090e-01
   7.09211111e-01  -6.57156706e-01   3.26824188e-02  -4.25864220e-01
  -5.86277485e-01   8.16827059e-01  -5.57327747e-01  -3.35038900e-01]
[ -9.33161497e-01   8.45068693e-01  -8.14761639e-01  -5.67158937e-01
   5.23060560e-01   4.90430593e-01  -9.11595106e-01   4.36383963e-01
  -9.69607353e-01  -6.64181471e-01  -4.44166183e-01   7.78196335e-01
  -5.34924030e-01   6.49461985e-01   5.69838047e-01   2.50927448e-01
  -8.87476921e-01  -3.74064207e-01   4.24978733e-02   1.25571489e-01]
[  9.89913464e-01   3.36525917e-01  -1.86083794e-01  -5.25027514e-01
  -8.87480021e-01   8.53247643e-02   4.10822868e-01   3.29172134e-01
   8.56166363e-01   5.12266636e-01   7.75470734e-01   7.89757490e-01
  -9.44452286e-02  -8.79762173e-01   1.57778263e-02  -8.59814644e-01
   4.55990076e-01   4.06166315e-01  -8.40348721e-01  -2.75753498e-01]
[  5.79052448e-01  -3.62973213e-01  -8.79675150e-01  -9.98473167e-01
  -1.73240185e-01   7.07520723e-01   4.95352268e-01   4.99097586e-01
  -5.02996445e-02  -4.01979208e-01   5.94721079e-01   7.37986326e-01
  -6.61164761e-01   6.45744085e-01  -4.68054295e-01  -5.54257870e-01
   5.12778997e-01   7.89849758e-01   2.42011547e-02  -2.77193785e-01] ... ]

training-words.txt

Exporter un lot de formation de 64 mots, avec une taille de vecteur de 128x20, exporté sous le nom training-words.txt.

[2831 2831 1906 1906   25   25    1    1  221  221   37   37    1    1 1840
1840  655  655    3    3   22   22  971  971    4    4    1    1  481  481
4235 4235  297  297    0    0    7    7 1343 1343   16   16   53   53  172
 172    1    1 1080 1080 1831 1831    0    0    2    2    0    0 1804 1804
   1    1  590  590  653  653    3    3   16   16  489  489    2    2    7
   7    8    8    5    5    0    0   56   56 1313 1313   13   13   14   14
  44   44 3432 3432    6    6    1    1   98   98  744  744   23   23   16
  16  489  489   56   56   85   85    4    4  224  224    5    5    0    0
1080 1080    1    1    0    0  474  474]


Ou en mots:

['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']

training-window-words.txt

Exporter les 128 mots-fenêtre connectés, un à gauche, un à droite, avec une taille de vecteur de 128x20, exporté en tant que training-window-words.txt.

[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]


Ou en mots:

['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']

Mise à jour du calcul de similarité cosinus

Visualisation des mises à jour du calcul de similarité cosinus.

...

logfile.txt

Enregistrer le journal d'entraînement, exporté sous le nom logfile.txt.


Nearest to collective: Beyond, Although, luxury, confirmed, pointless, Born, colour, stick, scattered, somewhere,
Nearest to being: direcdy, appropriate, 8000, muito, disgusting, broad, southeast, Longer, completed, Before,
Nearest to social: photograph, Working, Hung, coasts, teacher, skins, cuts, extent, sheets, worth,


Nearest to collective: manioc, colour, work, grass, simply, adopted, it, particular, groups, concerned,
Nearest to being: jaguar, said, longer, sky, adopted, this, design, From, better, Longer,
Nearest to social: fall, make, photograph, yellow, given, than, took, men, worth, clouds,


Nearest to collective: manioc, colour, work, simply, grass, adopted, Beyond, horizons, particular, position,
Nearest to being: Longer, said, adopted, jaguar, longer, design, Before, sky, From, completed,
Nearest to social: photograph, fall, yellow, make, Hung, skins, given, worth, extent, teacher,


...


Nearest to collective: Beyond, Although, tubes, heightened, Born, line, horizons, tongue, occupied, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, why, goiania, Behind, Children, negative,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,


Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, horizons, lower, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, feeling, northern, humanity, derisory,


Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, lower, unexpected, horizons,
Nearest to being: Difficulty, maintained, mass, control, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,


(NDLT: Seules les annotations du script ont été traduites)