LDA 以及 Gensim 实现

用 Gensim 实现 LDA,相比 JGibbLDA 的使用 Gensim 略为麻烦,然而感觉更清晰易懂,也就更灵活。

LDA 介绍

LDA 是一种典型的词袋模型,即一篇文档是由一组词构成,词与词之间没有顺序以及先后的关系。一篇文档可以包含多个主题,文档中每一个词都由其中的一个主题生成。

需要理解的概念有:

  • 一个函数:gamma 函数
  • 两个分布:beta分布、Dirichlet分布
  • 一个模型:LDA(文档-主题,主题-词语)
  • 一个采样:Gibbs采样

核心公式:

1
p(w|d) = p(w|t)*p(t|d)

文档的生成过程

  • 从狄利克雷分布中取样生成文档 i 的主题分布 $\theta_i$
  • 从主题的多项式分布中取样生成文档i第 j 个词的主题 $z_{i,j}$
  • 从狄利克雷分布中取样生成主题对应的词语分布 $\varnothing_{z_{i,j}}$
  • 从词语的多项式分布 $\varnothing_{z_{i,j}}$ 中采样最终生成词语 $w_{i,j}$

怎么选择 topic 个数

  • 最小化 topic 的相似度
  • perplexity

Python gensim 实现

# install the related python packages
>>> pip install numpy
>>> pip install scipy
>>> pip install gensim
>>> pip install jieba

from gensim import corpora, models, similarities
import logging
import jieba

# configuration
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load data from file
f = open('newfile.txt', 'r')
documents = f.readlines()

# tokenize
texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents]

# load id->word mapping (the dictionary)
dictionary = corpora.Dictionary(texts)

# word must appear >10 times, and no more than 40% documents
dictionary.filter_extremes(no_below=40, no_above=0.1)

# save dictionary
dictionary.save('dict_v1.dict')

# load corpus
corpus = [dictionary.doc2bow(text) for text in texts]

# initialize a model
tfidf = models.TfidfModel(corpus)

# use the model to transform vectors, apply a transformation to a whole corpus
corpus_tfidf = tfidf[corpus]

# extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents), using 500 iterations
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=100, iterations=500)

# save model to files
lda.save('mylda_v1.pkl')

# print topics composition, and their scores, for the first document. You will see that only few topics are represented; the others have a nil score.
for index, score in sorted(lda[corpus_tfidf[0]], key=lambda tup: -1*tup[1]):
    print "Score: {}\t Topic: {}".format(score, lda.print_topic(index, 10))

# print the most contributing words for 100 randomly selected topics
lda.print_topics(100)

# load model and dictionary
model = models.LdaModel.load('mylda_v1.pkl')
dictionary = corpora.Dictionary.load('dict_v1.dict')

# predict unseen data
query = "未收到奖励"
query_bow = dictionary.doc2bow(jieba.cut(query, cut_all = False))
for index, score in sorted(model[query_bow], key=lambda tup: -1*tup[1]):
    print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))

# if you want to predict many lines of data in a file, do the followings
f = open('newfile.txt', 'r')
documents = f.readlines()
texts = [[word for word in jieba.cut(document, cut_all = False)] for document in documents]
corpus = [dictionary.doc2bow(text) for text in texts]

# only print the topic with the highest score
for c in corpus:
    flag = True
    for index, score in sorted(model[c], key=lambda tup: -1*tup[1]):
        if flag:
            print "Score: {}\t Topic: {}".format(score, model.print_topic(index, 20))

Tips

If you occur encoding problems, you can try the following code

add it at the beginning of your python file
# -*- coding: utf-8 -*-

# also, do the followings
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

# the following code may lead to encoding problem when there're Chinese characters
model.show_topics(-1, 5)

# use this instead
model.print_topics(-1, 5)

You can see step-by-step output by the following references.

References:

https://radimrehurek.com/gensim/tut2.html official guide (en)

http://blog.csdn.net/questionfish/article/details/46725475 official guide (ch)

https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

徐阿衡 wechat
欢迎关注:徐阿衡的微信公众号
客官,打个赏呗~