gensim的doc2vec找不到多少资料,根据官方api探索性的做了些尝试。本文介绍了利用gensim的doc2vec来训练模型,infer新文档向量,infer相似度等方法,有一些不成熟的地方,后期会继续改进。
导入模块
# -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf8') import gensim, logging import os import jieba # logging information logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
读取文件
# get input file, text format f = open('trainingdata.txt','r') input = f.readlines() count = len(input) print count
文件预处理,分词等
# read file and separate words alldocs=[] # for the sake of check, can be removed count=0 # for the sake of check, can be removed for line in input: line=line.strip('\n') seg_list = jieba.cut(line) output.write(' '.join(seg_list) + '\n') alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count)) # for the sake of check, can be removed count+=1 # for the sake of check, can be removed
模型选择
gensim Doc2Vec 提供了 DM 和 DBOW 两个模型。gensim 的说明文档建议多次训练数据集并调整学习速率或在每次训练中打乱输入信息的顺序以求获得最佳效果。
# PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size Doc2Vec(sentences,dm=1, dm_concat=1, size=100, window=2, hs=0, min_count=2, workers=cores) # PV-DBOW Doc2Vec(sentences,dm=0, size=100, hs=0, min_count=2, workers=cores) # PV-DM w/average Doc2Vec(sentences,dm=1, dm_mean=1, size=100, window=2, hs=0, min_count=2, workers=cores)
训练并保存模型
# train and save the model sentences= gensim.models.doc2vec.TaggedLineDocument('output.seq') model = gensim.models.Doc2Vec(sentences,size=100, window=3) model.train(sentences) model.save('all_model.txt')
保存文档向量
# save vectors out=open("all_vector.txt","wb") for num in range(0,count): docvec =model.docvecs[num] out.write(docvec) #print num #print docvec out.close()
检验 计算训练文档中的文档相似度
# test, calculate the similarity # 注意 docid 是从0开始计数的 # 计算与训练集中第一篇文档最相似的文档 sims = model.docvecs.most_similar(0) print sims # get similarity between doc1 and doc2 in the training data sims = model.docvecs.similarity(1,2) print sims
infer向量,比较相似度
下面的代码用于检验模型正确性,随机挑一篇trained dataset中的文档,用模型重新infer,再计算与trained dataset中文档相似度,如果模型良好,相似度第一位应该就是挑出的文档。
# check ############################################################################# # A good check is to re-infer a vector for a document already in the model. # # if the model is well-trained, # # the nearest doc should (usually) be the same document. # ############################################################################# print 'examing' doc_id = np.random.randint(model.docvecs.count) # pick random doc; re-run cell for more examples print('for doc %d...' % doc_id) inferred_docvec = model.infer_vector(alldocs[doc_id].words) print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))
遇到的问题
👇两个错误还在探索中,根据官方指南是可以运行的,然而我遇到了错误并没能解决。
第一段错误代码,关于train the model
alldocs=[] count=0 for line in input: #print line line=line.strip('\n') seg_list = jieba.cut(line) #output.write(line) output.write(' '.join(seg_list) + '\n') alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count)) count+=1 model = Doc2Vec(alldocs,size=100, window=2, min_count=5, workers=4) model.train(alldocs)
报错信息
Traceback (most recent call last): File "d2vTestv5.py", line 59, inmodel = Doc2Vec(alldocs[0],size=100, window=2, min_count=5, workers=4) File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 596, in __init__ self.build_vocab(documents, trim_rule=trim_rule) File "/usr/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 508, in build_vocab self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 639, in scan_vocab document_length = len(document.words) AttributeError: 'generator' object has no attribute 'words'
第二段错误代码,关于infer
doc_words1=['验证','失败','验证码','未','收到'] doc_words2=['今天','奖励','有','哪些','呢'] # get infered vector invec1 = model.infer_vector(doc_words1, alpha=0.1, min_alpha=0.0001, steps=5) invec2 = model.infer_vector(doc_words2, alpha=0.1, min_alpha=0.0001, steps=5) print invec1 print invec2 # get similarity # the output docid is supposed to be 0 sims = model.docvecs.most_similar([invec1]) print sims # according to official guide, the following codes are supposed to be fine, but it fails to run sims= model.docvecs.similarity(invec1,invec2) print model.similarity(['今天','有','啥','奖励'],['今天','奖励','有','哪些','呢'])
最后两行代码报错,错误信息
raceback (most recent call last): File "d2vTestv5.py", line 110, insims= model.docvecs.similarity(invec1,invec2) File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 484, in similarity return dot(matutils.unitvec(self[d1]), matutils.unitvec(self[d2])) File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__ return vstack([self[i] for i in index]) File "/usr/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__ return vstack([self[i] for i in index]) TypeError: 'numpy.float32' object is not iterable
回顾
这里我们尝试了很多种方法作比较研究。
- 纯 log 模型
- 纯 百科 模型
- 百科模型 + log 再训练模型
- log 词库 + 百科模型 + log 再训练模型 (用到了 reset_weights 方法)
- 综合来讲,log 词库,百科数据训练模型,log 再训练的方法效果会更好些,然而增加百科数据并不会大幅提升效果。
- 对纯 log 模型而言,win=5,4的结果差不多,都要比 win=2 好很多。
- log 模型对相近词的把握不是很好,前两个词非常准确,但是之后的词就没有多少代表性了,主要是因为词库里有大量噪音,加上百科数据训练,词的权重进行调整,会更偏向百科里的词,有人会有疑问,为什么 log 的词库百科训练会出现那么多百科的词,那是因为 log 里有新闻/百科的文本,包含了这些词,是谁这么无聊……
- 有效的语料库和干净的文本数据是模型分析的保证。有效的语料库和干净的文本数据是模型分析的保证。有效的语料库和干净的文本数据是模型分析的保证。重要的事情说三遍!
eg. 与“奖励”最相近的词
# 纯 log 模型 奖 0.866039454937 奖金 0.838458776474 礼 0.698936760426 截止 0.662528753281 % 0.639326810837 周期 0.61717569828 1.8 0.609462141991 抽奖 0.581079006195 责 0.580395340919 消息 0.57931292057 # log 词库,百科训练模型 嘉奖 0.607903599739 奖赏 0.607445776463 报酬 0.59623169899 声望 0.580911517143 阴谋 0.557106971741 表扬 0.54744797945 奖品 0.543839931488 惩罚 0.540722668171 弱点 0.535359799862 俸禄 0.532780826092 # log 词库,百科训练模型,log 再训练 奖 0.86665225029 奖金 0.828586399555 补贴 0.731625974178 补助 0.640836119652 回事 0.638447761536 补偿 0.63090801239 账 0.630112946033 帐 0.605027675629 区别 0.58495759964 原因 0.584367990494
参考链接
https://radimrehurek.com/gensim/models/doc2vec.html
https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://blog.csdn.net/raycchou/article/details/50971599