CMStatistics 2022: Start Registration
View Submission - CMStatistics
B2035
Title: A text similarity-based algorithm for seed word generation in improving document classification Authors:  Morteza Namvar - The University of Queensland (Australia) [presenting]
Celeste Li - The University of Queensland (Australia)
Abstract: Topic modelling techniques typically use document-level co-occurrence information to group semantically related words into a single cluster or topic. Since the objective of these models is to maximize the probability of the observed data, the identified topics tend to explain only the most obvious aspects of a corpus and do not necessarily represent a construct. Interactive topic modelling techniques can be used as an alternative to unsupervised ones, as they can tackle the above issues by developing topics based on the initial seed words. As the performance of these interactive techniques heavily depends on the initial seed words, our study proposes how text features can be used to generate seed words in developing interactive topic models. We propose a method for seed word vector (SWV) generation. We provide initial SWVs for interactive topic modelling through qualitative content analysis. Then through several iterations, our developed algorithm updates SWVs from the corpus by considering document similarity. Our method of SWV generation, combined with interactive topic modelling, helps develop a probability vector of each document in the corpus, indicating their relevance to study constructs. To test our proposed method's validity and applicability in practice, we investigate the post-adoption use of contact tracing mobile applications during the COVID-19 pandemic. The results show a significant improvement in topic modelling using the generated SWVs.