Jupyter Notebook使用gensim做Word2Vec计算

摘要: 那么什么是Word2Vec模型? 能否实际在Python里做相关实验呢?本Notebook将做对这2个问题进行研究和探索。用Python编程时,主要为了观察Word2Vec的计算过程,体验gensim库函数的用法,此后在另一个Jupyter Notebook中 ...




2.《基于语义网络的研究兴趣相似性度量方法》,该范文介绍了研究作者收集了《中文社会科学引文索引》(CSSCI)上的期刊论文一共2791篇,涉及作者2104位,关键词4725个。为便于作者兴趣矩阵相似性的计算, 本文针对各核心作者选取相同数量的关键词进行word2vec 建模学习。另外, 在选取关键词表示作者研究兴趣时, 删除对分析作者研究兴趣相似性以及分析领域热点较低贡献的概括性关键词, 如电子政务、电子政府等。通过引入word2vec模型对作者关键词进行词向量表示,将关键词表示成语义级别的低维实值分布;计算关键词之间的语义相关度并构造关键词语义网络,采用JS距离对构建的作者研究兴趣矩阵进行相似性度量。

那么什么是Word2Vec模型? 能否实际在Python里做相关实验呢?本Notebook将做对这2个问题进行研究和探索。用Python编程时,主要为了观察Word2Vec的计算过程,体验gensim库函数的用法,此后在另一个Jupyter Notebook中我们将用实际数据做实验。为了观察到计算过程,打开了日志输出。




word2vec也叫word embeddings,中文名“词向量”,作用就是将自然语言中的字词转为计算机可以理解的稠密向量(Dense Vector)。

word2vec主要分为CBOW(Continuous Bag of Words)和Skip-Gram两种模式。


对同样一个句子:Hangzhou is a nice city。我们要构造一个语境与目标词汇的映射关系,其实就是input与label的关系。


1. CBOW可以制造的映射关系为:[Hangzhou,a]—>is,[is,nice]—>a,[a,city]—>nice


2. Skip-Gram可以制造的映射关系为(is,Hangzhou),(is,a),(a,is), (a,nice),(nice,a),(nice,city)







Gensim是一个免费的 Python库,旨在从文档中自动提取语义主题,尽可能高效(计算机方面)和无痛(人性化)。

Gensim旨在处理原始的非结构化数字文本(“ 纯文本 ”)。

在Gensim的算法,比如Word2Vec,FastText,潜在语义分析(LSI,LSA,见LsiModel),隐含狄利克雷分布(LDA,见LdaModel)等。这些算法是无监督的,这意味着不需要人工输入 - 您只需要一个纯文本文档。





pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gensim #国内安装使用清华的源,速度快


基于测试数据和Gensim官网的教程,在Jupyter Notebook中使用Python做word2vec模型实验。

4,把matplotlib输出的图形内嵌到Jupyter Notebook中

下面这行执行后,使用matplotlib画图,会直接显示在Jupyter Notebook中

%matplotlib inline


把实验过程中的日志信息直接在Jupyter Notebook中输出,这样容易观察word2vec的计算过程。

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



from gensim.test.utils import datapath

from gensim import utils

import gensim.models


2021-09-08 16:09:58,638 : INFO : adding document #0 to Dictionary(0 unique tokens: [])

2021-09-08 16:09:58,643 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)

2021-09-08 16:09:58,644 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-09-08T16:09:58.643202', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}



class MyCorpus:

    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):

        corpus_path = datapath('lee_background.cor')

        for line in open(corpus_path):

            # 假设每行一篇文档,每个文档由空格分隔的多个词组成

            # assume there's one document per line, tokens separated by whitespace

            yield utils.simple_preprocess(line)


sentences = MyCorpus()



min_count:缺省值是5, 表示在预料库里不少于5次的词才会被保留

vector_size: Gensim Word2vec 映射单词的 N 维空间的维度 (N) 数。

model = gensim.models.Word2Vec(sentences=sentences)


2021-09-08 16:11:26,610 : INFO : collecting all words and their counts

2021-09-08 16:11:26,612 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2021-09-08 16:11:26,737 : INFO : collected 6981 word types from a corpus of 58152 raw words and 300 sentences

2021-09-08 16:11:26,737 : INFO : Creating a fresh vocabulary

2021-09-08 16:11:26,767 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 1750 unique words (25.068041827818366%% of original 6981, drops 5231)', 'datetime': '2021-09-08T16:11:26.767168', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,769 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 49335 word corpus (84.83801073049938%% of original 58152, drops 8817)', 'datetime': '2021-09-08T16:11:26.769168', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,785 : INFO : deleting the raw counts dictionary of 6981 items

2021-09-08 16:11:26,787 : INFO : sample=0.001 downsamples 51 most-common words

2021-09-08 16:11:26,788 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 35935.33721568072 word corpus (72.8%% of prior 49335)', 'datetime': '2021-09-08T16:11:26.788156', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:11:26,827 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes

2021-09-08 16:11:26,828 : INFO : resetting layer weights

2021-09-08 16:11:26,851 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2021-09-08T16:11:26.851118', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'build_vocab'}

2021-09-08 16:11:26,852 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-09-08T16:11:26.852121', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:11:26,977 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:26,981 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:26,983 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:26,984 : INFO : EPOCH - 1 : training on 58152 raw words (35896 effective words) took 0.1s, 278642 effective words/s

2021-09-08 16:11:27,084 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,086 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,094 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,096 : INFO : EPOCH - 2 : training on 58152 raw words (35990 effective words) took 0.1s, 329902 effective words/s

2021-09-08 16:11:27,195 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,202 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,205 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,207 : INFO : EPOCH - 3 : training on 58152 raw words (35921 effective words) took 0.1s, 339330 effective words/s

2021-09-08 16:11:27,312 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,313 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,318 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,319 : INFO : EPOCH - 4 : training on 58152 raw words (36054 effective words) took 0.1s, 329277 effective words/s

2021-09-08 16:11:27,420 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:11:27,421 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:11:27,430 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:11:27,432 : INFO : EPOCH - 5 : training on 58152 raw words (35870 effective words) took 0.1s, 337083 effective words/s

2021-09-08 16:11:27,433 : INFO : Word2Vec lifecycle event {'msg': 'training on 290760 raw words (179731 effective words) took 0.6s, 309329 effective words/s', 'datetime': '2021-09-08T16:11:27.433787', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:11:27,433 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=1750, vector_size=100, alpha=0.025)', 'datetime': '2021-09-08T16:11:27.433787', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


从已经训练好的模型中,获取单词“king”的矢量, 并显示输出

vec_king = model.wv['king']



[-6.92510232e-03  4.14995067e-02  1.59823261e-02  1.46031603e-02

  5.08679077e-03 -6.39730245e-02  4.07829806e-02  7.24346042e-02

 -1.78169943e-02 -8.31861142e-03 -5.16973110e-03 -6.55126572e-02

  5.94190601e-03  1.41455485e-02  7.18757603e-03 -7.89120328e-03

 -1.98559370e-03 -9.05822031e-03 -6.99098874e-03 -5.86852096e-02

  3.40853035e-02  2.03361809e-02  8.10676254e-03 -8.25211697e-04

 -1.60578378e-02  1.37015795e-02 -2.02752780e-02 -3.86285335e-02

 -2.95265168e-02  6.02840958e-03  2.31080819e-02 -2.63607632e-02

  3.32033597e-02 -3.91531475e-02  5.09816781e-03  3.13201882e-02

  1.86994895e-02 -9.04701278e-03 -6.88288175e-03 -3.53833996e-02

 -7.21966149e-03 -1.11886691e-02 -2.19836980e-02  7.16285687e-03

  2.69754194e-02 -2.45087147e-02 -2.37088483e-02 -3.23272659e-03

  1.52070969e-02  2.66285371e-02  1.97965223e-02 -2.19922252e-02

 -2.94247735e-02  4.79266094e-03 -1.58476236e-03  1.47990910e-02

  7.19935913e-03  1.92051902e-02 -2.50961687e-02  1.73416305e-02

 -3.54305084e-04  6.87906425e-03  8.16148333e-03 -1.34457452e-02

 -4.21114005e-02  4.24223766e-02 -2.52795685e-03  2.80112531e-02

 -2.31643170e-02  3.85041460e-02 -1.70745552e-02  5.71923843e-03

  5.32592610e-02 -1.80432275e-02  3.00998446e-02  2.71892790e-02

 -8.73102620e-03 -2.51155049e-02 -3.51627842e-02 -4.78182407e-03

 -2.93990150e-02  3.29859066e-03 -3.37860733e-02  4.85829152e-02

  6.55127733e-05 -1.12926299e-02  5.20297652e-03  4.24825326e-02

  3.99986207e-02  1.23887025e-02  3.60009037e-02  3.50367576e-02

  3.11715566e-02  7.25049153e-03  8.06821436e-02  2.94752847e-02

  2.93809213e-02 -1.59397889e-02  1.09631000e-02 -1.44344661e-02]


for index, word in enumerate(model.wv.index_to_key):

    if index == 10:


    print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")


word #0/1750 is the

word #1/1750 is to

word #2/1750 is of

word #3/1750 is in

word #4/1750 is and

word #5/1750 is he

word #6/1750 is is

word #7/1750 is for

word #8/1750 is on

word #9/1750 is said



import tempfile

model_path = ''

with tempfile.NamedTemporaryFile(prefix='gensim-model-', delete=False) as tmp:

    temporary_filepath = tmp.name


    model_path = temporary_filepath


2021-09-08 16:11:43,410 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'C:\\Users\\work\\AppData\\Local\\Temp\\gensim-model-kpg71m3y', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-09-08T16:11:43.410988', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}

2021-09-08 16:11:43,412 : INFO : not storing attribute cum_table

2021-09-08 16:11:43,415 : INFO : saved C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y



new_model = gensim.models.Word2Vec.load(model_path)


2021-09-08 16:11:48,920 : INFO : loading Word2Vec object from C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y

2021-09-08 16:11:48,984 : INFO : loading wv recursively from C:\Users\work\AppData\Local\Temp\gensim-model-kpg71m3y.wv.* with mmap=None

2021-09-08 16:11:48,985 : INFO : setting ignored attribute cum_table to None

2021-09-08 16:11:49,022 : INFO : Word2Vec lifecycle event {'fname': 'C:\\Users\\work\\AppData\\Local\\Temp\\gensim-model-kpg71m3y', 'datetime': '2021-09-08T16:11:49.022784', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'loaded'}



more_sentences = [

    ['Advanced', 'users', 'can', 'load', 'a', 'model',

     'and', 'continue', 'training', 'it', 'with', 'more', 'sentences'],


new_model.build_vocab(more_sentences, update=True)

new_model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)


2021-09-08 16:12:02,226 : INFO : collecting all words and their counts

2021-09-08 16:12:02,227 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2021-09-08 16:12:02,227 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences

2021-09-08 16:12:02,228 : INFO : Updating model with new vocabulary

2021-09-08 16:12:02,236 : INFO : Word2Vec lifecycle event {'msg': 'added 0 new unique words (0.0%% of original 13) and increased the count of 0 pre-existing words (0.0%% of original 13)', 'datetime': '2021-09-08T16:12:02.236508', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:12:02,237 : INFO : deleting the raw counts dictionary of 13 items

2021-09-08 16:12:02,238 : INFO : sample=0.001 downsamples 0 most-common words

2021-09-08 16:12:02,238 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 0 word corpus (0.0%% of prior 0)', 'datetime': '2021-09-08T16:12:02.238507', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}

2021-09-08 16:12:02,255 : INFO : estimated required memory for 1750 words and 100 dimensions: 2275000 bytes

2021-09-08 16:12:02,256 : INFO : updating layer weights

2021-09-08 16:12:02,257 : INFO : Word2Vec lifecycle event {'update': True, 'trim_rule': 'None', 'datetime': '2021-09-08T16:12:02.257496', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'build_vocab'}

2021-09-08 16:12:02,258 : WARNING : Effective 'alpha' higher than previous training cycles

2021-09-08 16:12:02,258 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2021-09-08T16:12:02.258496', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}

2021-09-08 16:12:02,263 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,265 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,266 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,267 : INFO : EPOCH - 1 : training on 13 raw words (5 effective words) took 0.0s, 1180 effective words/s

2021-09-08 16:12:02,268 : WARNING : EPOCH - 1 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,272 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,272 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,273 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,274 : INFO : EPOCH - 2 : training on 13 raw words (6 effective words) took 0.0s, 2247 effective words/s

2021-09-08 16:12:02,274 : WARNING : EPOCH - 2 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,280 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,281 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,282 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,284 : INFO : EPOCH - 3 : training on 13 raw words (5 effective words) took 0.0s, 1003 effective words/s

2021-09-08 16:12:02,285 : WARNING : EPOCH - 3 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,288 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,289 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,290 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,290 : INFO : EPOCH - 4 : training on 13 raw words (5 effective words) took 0.0s, 2002 effective words/s

2021-09-08 16:12:02,291 : WARNING : EPOCH - 4 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,297 : INFO : worker thread finished; awaiting finish of 2 more threads

2021-09-08 16:12:02,298 : INFO : worker thread finished; awaiting finish of 1 more threads

2021-09-08 16:12:02,299 : INFO : worker thread finished; awaiting finish of 0 more threads

2021-09-08 16:12:02,301 : INFO : EPOCH - 5 : training on 13 raw words (6 effective words) took 0.0s, 1440 effective words/s

2021-09-08 16:12:02,302 : WARNING : EPOCH - 5 : supplied example count (1) did not equal expected count (300)

2021-09-08 16:12:02,303 : INFO : Word2Vec lifecycle event {'msg': 'training on 65 raw words (27 effective words) took 0.0s, 612 effective words/s', 'datetime': '2021-09-08T16:12:02.303471', 'gensim': '4.0.1', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}


上面是基于gemsim官网例子做的word2vec模型实验。gemsim官网提供了一个例子,基于已经训练好的模型,查询输出car, minivan, bicycle ,airplane,cereal,communism 的相似度。这个模型文件有2G大,需要从国外下载模型,在内地环境我们就不做实验了。



import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

pairs = [

    ('car', 'minivan'),   # a minivan is a kind of car

    ('car', 'bicycle'),   # still a wheeled vehicle

    ('car', 'airplane'),  # ok, no wheels, but still a vehicle

    ('car', 'cereal'),    # ... and so on

    ('car', 'communism'),


for w1, w2 in pairs:

    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))



"\nimport gensim.downloader as api\nwv = api.load('word2vec-google-news-300')\n\npairs = [\n    ('car', 'minivan'),   # a minivan is a kind of car\n    ('car', 'bicycle'),   # still a wheeled vehicle\n    ('car', 'airplane'),  # ok, no wheels, but still a vehicle\n    ('car', 'cereal'),    # ... and so on\n    ('car', 'communism'),\n]\nfor w1, w2 in pairs:\n    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))\n"











