Scalable Inference for Logistic-Normal Topic Models

Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng and Bo Zhang

State Key Lab of Intelligent Tech. & Systems; Tsinghua National TNList Lab; Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

{chenjf10,wangzi10}@mails.tsinghua.edu.cn; {dcszj,dcszb}@mail.tsinghua.edu.cn; xunzheng@cs.cmu.edu

Abstract

Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise.

Full text

Download (1.85 MB)

Code

Download (0.36 MB)

GitHub

Approximate BibTex Entry

@inproceedings{chen_nips_2013,
    Year = {2013},
    Booktitle = {NIPS 2013},
    Author = { Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng, Bo Zhang },
    Title = {Scalable Inference for Logistic-Normal Topic Models}
}

Demonstration

Click on "nodes" to view the topics it contains.

This is a visualization for the correlation structure of 1,000 topics learned by CTM using our scalable sampler on the NYTimes corpus with 285,000 documents. We now build a 2-layer hierarchy by clustering the learned topics, with their learned correlation strength as the similarity measure. To represent their semantic meanings, we present 20 most frequent words for each topic in the box at the corner; and for each topic cluster, we also show most frequent words by building a hyper-topic that aggregates all the included topics. On the top layer, the size of each node is proportional to the topics contained in the hyper-topic. Clearly, we can see that many topics have strong correlations and the structure is useful to help humans understand/browse the large collection of topics. With 40 machines, our parallel Gibbs sampler finishes the training in 2 hours, which means that we are able to process real world corpus in considerable speed.