COMPSTAT 2022: Start Registration
View Submission - COMPSTAT2022
Title: Semiparametric latent topic modeling on consumer-generated corpora Authors:  Dominic Dayta - University of the Philippines (Philippines) [presenting]
Erniel Barrios - University of the Philippines (Philippines)
Abstract: Legacy procedures for topic modelling have generally suffered overfitting problems and weakness in reconstructing sparse topic structures. SemiparTM, a two-step approach utilizing nonnegative matrix factorization and semiparametric regression in topic modeling, is proposed. SemiparTM enables the reconstruction of sparse topic structures in the corpus and provides a generative model for predicting topics in new documents entering the corpus. Assuming the presence of auxiliary information related to the topics, this approach performs better in discovering underlying topic structures in cases of corpora that are small and limited in vocabulary. In an actual consumer feedback corpus, SemiparTM also demonstrably provides interpretable and useful topic definitions comparable with those produced by the legacy methods.