TY - GEN
T1 - Knowledge discovery from citation networks
AU - Guo, Zhen
AU - Zhang, Zhongfei
AU - Zhu, Shenghuo
AU - Chi, Yun
AU - Gong, Yihong
PY - 2009
Y1 - 2009
N2 - Knowledge discovery from scientific articles has received increasing attentions recently since huge repositories are made available by the development of the Internet and digital databases. In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. In the existing topic models, little effort is made to differentiate these two roles. We believe that the topic distributions of these two roles are different and related in a certain way. In this paper we propose a Bernoulli Process Topic (BPT) model which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multilevel hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach. In addition to conducting the experimental evaluations on the document modeling task, we also apply the BPT model to a well known scientific corpus to discover the latent topics. The comparisons against state-of-the-art methods demonstrate a very promising performance.
AB - Knowledge discovery from scientific articles has received increasing attentions recently since huge repositories are made available by the development of the Internet and digital databases. In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. In the existing topic models, little effort is made to differentiate these two roles. We believe that the topic distributions of these two roles are different and related in a certain way. In this paper we propose a Bernoulli Process Topic (BPT) model which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multilevel hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach. In addition to conducting the experimental evaluations on the document modeling task, we also apply the BPT model to a well known scientific corpus to discover the latent topics. The comparisons against state-of-the-art methods demonstrate a very promising performance.
KW - Latent models
KW - Text mining
KW - Unsupervised learning
UR - https://www.scopus.com/pages/publications/77951173676
U2 - 10.1109/ICDM.2009.137
DO - 10.1109/ICDM.2009.137
M3 - Conference contribution
SN - 9780769538952
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 800
EP - 805
BT - ICDM 2009 - The 9th IEEE International Conference on Data Mining
T2 - 9th IEEE International Conference on Data Mining, ICDM 2009
Y2 - 6 December 2009 through 9 December 2009
ER -