CMStatistics 2022: Start Registration
View Submission - CMStatistics
Title: Unsupervised topic identification in large short text corpora using mixture models Authors:  Jocelyn Mazarura - University of Pretoria (South Africa) [presenting]
Alta De waal - University of Pretoria (South Africa)
Pieter De Villiers - University of Pretoria (South Africa)
Abstract: Topic modelling is a subfield of natural language processing whose objective is to discover latent topics in large unlabelled corpora. Over the years, short texts, such as tweets and reviews, have become increasingly relevant due to the growing popularity of social media and online shopping. Traditional topic models assume that a document is generated from multiple topics. Whilst this assumption may be acceptable for long texts, such as e-books and news articles, many studies have shown that the one-topic-per-document assumption imposed by mixture models, such as the Dirichlet-multinomial mixture (DMM) model, fits short texts better. Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. It has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. The main contributions are a new Gamma-Poisson mixture (GPM) model and a collapsed Gibbs sampler, which enables the model to learn the number of topics contained in the corpus automatically. The results show that the GPM performs better than the DMM at selecting the number of topics in labelled corpora. Furthermore, the GPM produces better topic coherence scores, thus making it a viable option for the challenging task of topic modelling of short text.