Title: Variable selection in text regression: The case of short texts
Authors: Marzia Freo - University of Bologna (Italy) [presenting]
Alessandra Luati - University of Bologna (Italy)
Abstract: Communication through websites is often fast and characterised by short texts, made of few words, such as titles, image captions, questions and answers and tweets or posts in social media. The class of supervised learning methods for the analysis of short texts is explored. The aim is to assess the effectiveness of text data in social sciences when they are used as explanatory variables in parametric regression models. We compare the results obtained by several variants of the lasso, screening-based methods and randomization-based models, such as sure independent screening and stability selection. A widely applied unsupervised learning method for topic modelling, Latent Dirichlet Allocation, is also considered. The perspective is primarily empirical, and our starting point is the analysis of two real datasets, though bootstrap replications of each dataset. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms by using the internet as a source of data. The second case study is concerned with the informative content if open questions inserted in a questionnaire on overall satisfaction ratings. The two case studies are different in nature and thus representative of different kinds of short texts, as, in the first application, a short descriptive and objective text is considered, whereas, in the second case study, the short text is subjective and emotional.