Title: Measuring and comparing statistically the importance of terms in documents based on topic modelling
Authors: Louisa Kontoghiorghes - Kings College London (United Kingdom) [presenting]
Ana Colubi - Kings College London (United Kingdom)
Abstract: Topic modelling is a well-known text mining technique to identify the themes covered in a set of documents. The quantification of the importance of a topic, or topic prevalence, is essential in this area. However, tracing topics in a set of documents, or time series, lacks identifiability. The proposal is to focus on keywords instead of on topics to build a new prevalence metric. The new metric is still based on topic modelling, and it involves the topics related to the considered terms. The keywords can be predetermined or automatically extracted from previous documents or topic models. The suggested approach overcomes the identifiability problem and enables us to test changes in keywords/topic prevalences statistically. Thus, as a step forward, statistical hypothesis tests in this area will be developed. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The methodology is applied to analyze the change of essential themes in the conference CMStatistics.