用 Gensim 实现 LDA,相比 JGibbLDA 的使用 Gensim 略为麻烦,然而感觉更清晰易懂,也就更灵活。
LDA 介绍
LDA 是一种典型的词袋模型,即一篇文档是由一组词构成,词与词之间没有顺序以及先后的关系。一篇文档可以包含多个主题,文档中每一个词都由其中的一个主题生成。
需要理解的概念有:
- 一个函数:gamma 函数
- 两个分布:beta分布、Dirichlet分布
- 一个模型:LDA(文档-主题,主题-词语)
- 一个采样:Gibbs采样
核心公式:1p(w|d) = p(w|t)*p(t|d)
文档的生成过程
- 从狄利克雷分布中取样生成文档 i 的主题分布 $\theta_i$
- 从主题的多项式分布中取样生成文档i第 j 个词的主题 $z_{i,j}$
- 从狄利克雷分布中取样生成主题对应的词语分布 $\varnothing_{z_{i,j}}$
- 从词语的多项式分布 $\varnothing_{z_{i,j}}$ 中采样最终生成词语 $w_{i,j}$
怎么选择 topic 个数
- 最小化 topic 的相似度
- perplexity
Python gensim 实现
# install the related python packages >>> pip install numpy >>> pip install scipy >>> pip install gensim >>> pip install jieba from gensim import corpora, models, similarities import logging import jieba # configuration logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # load data from file f = open('newfile.txt', 'r') documents = f.readlines() # tokenize texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents] # load id->word mapping (the dictionary) dictionary = corpora.Dictionary(texts) # word must appear >10 times, and no more than 40% documents dictionary.filter_extremes(no_below=40, no_above=0.1) # save dictionary dictionary.save('dict_v1.dict') # load corpus corpus = [dictionary.doc2bow(text) for text in texts] # initialize a model tfidf = models.TfidfModel(corpus) # use the model to transform vectors, apply a transformation to a whole corpus corpus_tfidf = tfidf[corpus] # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents), using 500 iterations lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=100, iterations=500) # save model to files lda.save('mylda_v1.pkl') # print topics composition, and their scores, for the first document. You will see that only few topics are represented; the others have a nil score. for index, score in sorted(lda[corpus_tfidf[0]], key=lambda tup: -1*tup[1]): print "Score: {}\t Topic: {}".format(score, lda.print_topic(index, 10)) # print the most contributing words for 100 randomly selected topics lda.print_topics(100) # load model and dictionary model = models.LdaModel.load('mylda_v1.pkl') dictionary = corpora.Dictionary.load('dict_v1.dict') # predict unseen data query = "未收到奖励" query_bow = dictionary.doc2bow(jieba.cut(query, cut_all = False)) for index, score in sorted(model[query_bow], key=lambda tup: -1*tup[1]): print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20)) # if you want to predict many lines of data in a file, do the followings f = open('newfile.txt', 'r') documents = f.readlines() texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents] corpus = [dictionary.doc2bow(text) for text in texts] # only print the topic with the highest score for c in corpus: flag = True for index, score in sorted(model[c], key=lambda tup: -1*tup[1]): if flag: print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))
Tips
If you occur encoding problems, you can try the following code
add it at the beginning of your python file # -*- coding: utf-8 -*- # also, do the followings import sys reload(sys) sys.setdefaultencoding('utf-8') # the following code may lead to encoding problem when there're Chinese characters model.show_topics(-1, 5) # use this instead model.print_topics(-1, 5)
You can see step-by-step output by the following references.
References:
https://radimrehurek.com/gensim/tut2.html official guide (en)
http://blog.csdn.net/questionfish/article/details/46725475 official guide (ch)
https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation