gensim-doc2vec实战

gensim的doc2vec找不到多少资料,根据官方api探索性的做了些尝试。本文介绍了利用gensim的doc2vec来训练模型,infer新文档向量,infer相似度等方法,有一些不成熟的地方,后期会继续改进。

导入模块

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import gensim, logging
import os
import jieba

# logging information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

读取文件

# get input file, text format
f = open('trainingdata.txt','r')
input = f.readlines()
count = len(input)
print count

文件预处理,分词等

# read file and separate words
alldocs=[] # for the sake of check, can be removed
count=0 # for the sake of check, can be removed
for line in input:
    line=line.strip('\n')
    seg_list = jieba.cut(line)
    output.write(' '.join(seg_list) + '\n')
    alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count)) # for the sake of check, can be removed
    count+=1 # for the sake of check, can be removed

模型选择

gensim Doc2Vec 提供了 DM 和 DBOW 两个模型。gensim 的说明文档建议多次训练数据集并调整学习速率或在每次训练中打乱输入信息的顺序以求获得最佳效果。

# PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(sentences,dm=1, dm_concat=1, size=100, window=2, hs=0, min_count=2, workers=cores)
# PV-DBOW  
Doc2Vec(sentences,dm=0, size=100, hs=0, min_count=2, workers=cores)
# PV-DM w/average
Doc2Vec(sentences,dm=1, dm_mean=1, size=100, window=2, hs=0, min_count=2, workers=cores)

训练并保存模型

# train and save the model
sentences= gensim.models.doc2vec.TaggedLineDocument('output.seq')
model = gensim.models.Doc2Vec(sentences,size=100, window=3)
model.train(sentences)
model.save('all_model.txt')

保存文档向量

# save vectors
out=open("all_vector.txt","wb")
for num in range(0,count):
    docvec =model.docvecs[num]
    out.write(docvec)
    #print num
    #print docvec
out.close()

检验 计算训练文档中的文档相似度

# test, calculate the similarity
# 注意 docid 是从0开始计数的
# 计算与训练集中第一篇文档最相似的文档
sims = model.docvecs.most_similar(0)
print sims
# get similarity between doc1 and doc2 in the training data
sims = model.docvecs.similarity(1,2)
print sims

infer向量,比较相似度

下面的代码用于检验模型正确性,随机挑一篇trained dataset中的文档,用模型重新infer,再计算与trained dataset中文档相似度,如果模型良好,相似度第一位应该就是挑出的文档。

# check
#############################################################################
# A good check is to re-infer a vector for a document already in the model. #
# if the model is well-trained,                                             #
# the nearest doc should (usually) be the same document.                    #
#############################################################################

print 'examing'
doc_id = np.random.randint(model.docvecs.count)  # pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
inferred_docvec = model.infer_vector(alldocs[doc_id].words)
print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

遇到的问题

👇两个错误还在探索中,根据官方指南是可以运行的,然而我遇到了错误并没能解决。
第一段错误代码,关于train the model

alldocs=[]
count=0
for line in input:
    #print line
    line=line.strip('\n')
    seg_list = jieba.cut(line)
    #output.write(line)
    output.write(' '.join(seg_list) + '\n')
    alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count))
    count+=1

model = Doc2Vec(alldocs,size=100, window=2, min_count=5, workers=4)
model.train(alldocs)

报错信息

Traceback (most recent call last):
  File "d2vTestv5.py", line 59, in 
    model = Doc2Vec(alldocs[0],size=100, window=2, min_count=5, workers=4)
  File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 596, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/usr/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 508, in build_vocab
    self.scan_vocab(sentences, trim_rule=trim_rule)  # initial survey
  File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 639, in scan_vocab
    document_length = len(document.words)
AttributeError: 'generator' object has no attribute 'words'

第二段错误代码,关于infer

doc_words1=['验证','失败','验证码','未','收到']
doc_words2=['今天','奖励','有','哪些','呢']
# get infered vector
invec1 = model.infer_vector(doc_words1, alpha=0.1, min_alpha=0.0001, steps=5)
invec2 = model.infer_vector(doc_words2, alpha=0.1, min_alpha=0.0001, steps=5)
print invec1
print invec2

# get similarity
# the output docid is supposed to be 0
sims = model.docvecs.most_similar([invec1])
print sims

# according to official guide, the following codes are supposed to be fine, but it fails to run
sims= model.docvecs.similarity(invec1,invec2)
print model.similarity(['今天','有','啥','奖励'],['今天','奖励','有','哪些','呢'])

最后两行代码报错,错误信息

raceback (most recent call last):
  File "d2vTestv5.py", line 110, in 
    sims= model.docvecs.similarity(invec1,invec2)
  File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 484, in similarity
    return dot(matutils.unitvec(self[d1]), matutils.unitvec(self[d2]))
  File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__
    return vstack([self[i] for i in index])
  File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__
    return vstack([self[i] for i in index])
TypeError: 'numpy.float32' object is not iterable

更多代码

回顾

这里我们尝试了很多种方法作比较研究。

  • 纯 log 模型
  • 纯 百科 模型
  • 百科模型 + log 再训练模型
  • log 词库 + 百科模型 + log 再训练模型 (用到了 reset_weights 方法)
  1. 综合来讲,log 词库,百科数据训练模型,log 再训练的方法效果会更好些,然而增加百科数据并不会大幅提升效果。
  2. 对纯 log 模型而言,win=5,4的结果差不多,都要比 win=2 好很多。
  3. log 模型对相近词的把握不是很好,前两个词非常准确,但是之后的词就没有多少代表性了,主要是因为词库里有大量噪音,加上百科数据训练,词的权重进行调整,会更偏向百科里的词,有人会有疑问,为什么 log 的词库百科训练会出现那么多百科的词,那是因为 log 里有新闻/百科的文本,包含了这些词,是谁这么无聊……
  4. 有效的语料库和干净的文本数据是模型分析的保证。有效的语料库和干净的文本数据是模型分析的保证。有效的语料库和干净的文本数据是模型分析的保证。重要的事情说三遍!

eg. 与“奖励”最相近的词

# 纯 log 模型
奖    0.866039454937
奖金    0.838458776474
礼    0.698936760426
截止    0.662528753281
%    0.639326810837
周期    0.61717569828
1.8    0.609462141991
抽奖    0.581079006195
责    0.580395340919
消息    0.57931292057

# log 词库,百科训练模型
嘉奖    0.607903599739
奖赏    0.607445776463
报酬    0.59623169899
声望    0.580911517143
阴谋    0.557106971741
表扬    0.54744797945
奖品    0.543839931488
惩罚    0.540722668171
弱点    0.535359799862
俸禄    0.532780826092

# log 词库,百科训练模型,log 再训练
奖    0.86665225029
奖金    0.828586399555
补贴    0.731625974178
补助    0.640836119652
回事    0.638447761536
补偿    0.63090801239
账    0.630112946033
帐    0.605027675629
区别    0.58495759964
原因    0.584367990494

参考链接
https://radimrehurek.com/gensim/models/doc2vec.html
https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://blog.csdn.net/raycchou/article/details/50971599

徐阿衡 wechat
欢迎关注:徐阿衡的微信公众号
客官,打个赏呗~