Word2vec basic.py FR
From Algolit
Type: | Extension Algolit |
Données: | Tristes Tropiques |
Technique: | plongement lexical |
Développé par: | une équipe de chercheurs dirigée par Tomas Mikolov chez Google, Claude Lévi-Strauss, Algolit |
Ceci est une version annotée du script de base word2vec. Le code est basé sur ce tutoriel Word2Vec fourni par Tensorflow.
Historique
Word2vec est constitué de modèles associés utilisés pour générer des vecteurs à partir de mots (aussi appelé plongement lexical). C'est un réseau neuronal à deux couches, produit par une équipe de chercheurs dirigée par Tomas Mikolov chez Google.
word2vec_basic_algolit.py
La structure du script word2vec annoté est la suivante:
- Étape 1: Télécharger les données.
- Algolit step 1: Lire les données du fichier texte brut
- Algolit inspection: wordlist.txt
- Étape 2: Créer un dictionnaire et remplacer les mots rares par un symbole UNK.
- Algolit inspection: counted.txt
- Algolit inspection: dictionary.txt
- Algolit inspection: data.txt
- Algolit inspection: disregarded.txt
- Algolit adaption: reversed-input.txt
- Étape 3: Fonction pour générer un lot de formation pour le modèle skip-gram
- Étape 4: Construire et former un modèle de skip-gram.
- Algolit inspection: big-random-matrix.txt
- Algolit adaption: sélectionnez votre propre ensemble de mots-test
- Étape 5: Commencer la formation.
- Algolit inspection: training-words.txt
- Algolit inspection: training-window-words.txt
- Algolit adaption: visualisation des mises à jour de calcul de similarité cosinus
- Algolit inspection: logfile.txt
- Étape 6: Visualisez les plongements.
- Algolit adaption: sélectionner 3 mots à inclure dans le graphique
Source
Le script word2vec_basic.py fournit une option pour télécharger un jeu de données à partir de la page d'accueil de Matt Mahoney. Il s'avère être un document en texte brut, sans ponctuation ni saut de ligne. Pour les tests que nous voulions faire avec le script, nous avons plutôt opté pour un extrait de littérature académique: Tristes Tropiques, écrit par Claude Lévi-Strauss et traduit par John Russell. (https://archive.org/details/tristestropiques000177mbp).
Avant que nous puissions utiliser le texte de Lévi-Strauss comme matériel de formation, nous devions supprimer toute la ponctuation du fichier. Pour ce faire, nous avons écrit un petit script python text-punctuation-clean-up.py. Le script enregistre une version *dépouillée* du livre d'origine sous un autre nom de fichier.
Le livre contient 153.003 mots au total dont 19.869 mots sont uniques.
wordlist.txt
D'un texte continu à une liste de mots, exporté en tant que wordlist.txt.
['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]
counted.txt
D'une liste de mots à une liste avec la structure [(mot, valeur)], exporté comme counted.txt.
[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]
dictionary.txt
Dictionnaire inversé, une liste des 5000 mots les plus courants (= taille du vocabulaire), accompagnés d'un numéro d'index, exportés en dictionnaire.txt.
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }
data.txt
L'objet data est créé, les textes originaux où les mots sont remplacés par des numéros d'index, exportés en tant que data.txt.
[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]
disregarded.txt
Liste des mots ignorés, qui ne correspondent pas à la taille du vocabulaire, exportés en tant que disregarded.txt.
['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]
reversed-input.txt
Version inversée de l'ensemble de données initial, où tous les mots d'exclusion sont remplacés par UNK (non connu), exporté en tant que reversed-input.txt.
UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...
big-random-matrix.txt
Une grande matrice aléatoire est créée, avec une taille de vecteur de 5000x20, exportée en tant que big-random-matrix.txt.
[[ 2.85661697e-01 9.69764948e-01 -7.59074926e-01 -6.15304947e-01 6.77072048e-01 -3.78361940e-01 -6.71523094e-01 3.94770384e-01 7.04541206e-02 -8.92262936e-01 5.87280035e-01 4.58304882e-02 2.53162384e-01 1.90168381e-01 -6.61255836e-01 -3.75634432e-01 -5.55147886e-01 4.49278116e-01 3.26536417e-01 8.64576340e-01] [ -6.70668364e-01 -5.53100824e-01 -3.71278524e-01 1.25042677e-01 -1.46459818e-01 -6.10010624e-01 9.19621468e-01 -1.55832767e-01 -7.70623922e-01 -1.44968033e-01 -6.36267662e-01 -1.87215090e-01 7.09211111e-01 -6.57156706e-01 3.26824188e-02 -4.25864220e-01 -5.86277485e-01 8.16827059e-01 -5.57327747e-01 -3.35038900e-01] [ -9.33161497e-01 8.45068693e-01 -8.14761639e-01 -5.67158937e-01 5.23060560e-01 4.90430593e-01 -9.11595106e-01 4.36383963e-01 -9.69607353e-01 -6.64181471e-01 -4.44166183e-01 7.78196335e-01 -5.34924030e-01 6.49461985e-01 5.69838047e-01 2.50927448e-01 -8.87476921e-01 -3.74064207e-01 4.24978733e-02 1.25571489e-01] [ 9.89913464e-01 3.36525917e-01 -1.86083794e-01 -5.25027514e-01 -8.87480021e-01 8.53247643e-02 4.10822868e-01 3.29172134e-01 8.56166363e-01 5.12266636e-01 7.75470734e-01 7.89757490e-01 -9.44452286e-02 -8.79762173e-01 1.57778263e-02 -8.59814644e-01 4.55990076e-01 4.06166315e-01 -8.40348721e-01 -2.75753498e-01] [ 5.79052448e-01 -3.62973213e-01 -8.79675150e-01 -9.98473167e-01 -1.73240185e-01 7.07520723e-01 4.95352268e-01 4.99097586e-01 -5.02996445e-02 -4.01979208e-01 5.94721079e-01 7.37986326e-01 -6.61164761e-01 6.45744085e-01 -4.68054295e-01 -5.54257870e-01 5.12778997e-01 7.89849758e-01 2.42011547e-02 -2.77193785e-01] ... ]
training-words.txt
Exporter un lot de formation de 64 mots, avec une taille de vecteur de 128x20, exporté sous le nom training-words.txt.
[2831 2831 1906 1906 25 25 1 1 221 221 37 37 1 1 1840 1840 655 655 3 3 22 22 971 971 4 4 1 1 481 481 4235 4235 297 297 0 0 7 7 1343 1343 16 16 53 53 172 172 1 1 1080 1080 1831 1831 0 0 2 2 0 0 1804 1804 1 1 590 590 653 653 3 3 16 16 489 489 2 2 7 7 8 8 5 5 0 0 56 56 1313 1313 13 13 14 14 44 44 3432 3432 6 6 1 1 98 98 744 744 23 23 16 16 489 489 56 56 85 85 4 4 224 224 5 5 0 0 1080 1080 1 1 0 0 474 474]
Ou en mots:
['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']
training-window-words.txt
Exporter les 128 mots-fenêtre connectés, un à gauche, un à droite, avec une taille de vecteur de 128x20, exporté en tant que training-window-words.txt.
[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]
Ou en mots:
['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']
Mise à jour du calcul de similarité cosinus
Visualisation des mises à jour du calcul de similarité cosinus.
...
logfile.txt
Enregistrer le journal d'entraînement, exporté sous le nom logfile.txt.
Nearest to collective: Beyond, Although, luxury, confirmed, pointless, Born, colour, stick, scattered, somewhere,
Nearest to being: direcdy, appropriate, 8000, muito, disgusting, broad, southeast, Longer, completed, Before,
Nearest to social: photograph, Working, Hung, coasts, teacher, skins, cuts, extent, sheets, worth,
Nearest to collective: manioc, colour, work, grass, simply, adopted, it, particular, groups, concerned,
Nearest to being: jaguar, said, longer, sky, adopted, this, design, From, better, Longer,
Nearest to social: fall, make, photograph, yellow, given, than, took, men, worth, clouds,
Nearest to collective: manioc, colour, work, simply, grass, adopted, Beyond, horizons, particular, position,
Nearest to being: Longer, said, adopted, jaguar, longer, design, Before, sky, From, completed,
Nearest to social: photograph, fall, yellow, make, Hung, skins, given, worth, extent, teacher,
...
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, horizons, tongue, occupied, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, why, goiania, Behind, Children, negative,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, horizons, lower, unexpected,
Nearest to being: Difficulty, maintained, control, mass, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, feeling, northern, humanity, derisory,
Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, lower, unexpected, horizons,
Nearest to being: Difficulty, maintained, mass, control, Three, goiania, Behind, why, characteristics, Instead,
Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,
(NDLT: Seules les annotations du script ont été traduites)