CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1602
Title: Robust and consistent estimation of word embeddings for Italian tweets Authors:  Elena Catanese - ISTAT (Italy)
Mauro Bruno - Istat (Italy) [presenting]
Abstract: Word Embeddings (WEs) are a popular class of models for word representations used in many natural language processing tasks. They provide a low-dimensional, dense vector representation of a word. WEs are usually generated from a large corpus of structured text, e.g., Wikipedia. Tweets are short, noisy and have lexical and semantic features that are different from other types of texts. To our knowledge, extensive robustness analysis on Twitter data has not been carried out. There are several WE methods, such as, Word2Vec, Fasttext and Glove. Each method requires a fine-tuning of several hyper-parameters. Istat currently utilizes a Word2Vec approach to perform thematic analysis on Twitter data. We measure the stability of each WE method, by varying a pre-defined set of hyper-parameters. Solving word analogies is another popular benchmark for WE, based on the assumption that linear relations between word pairs are indicative of the quality of the embedding. Therefore, we compare the performance of the above-mentioned methods on the prediction of specific analogies related to the economy. Insight into the robustness of WE models and on the consistency of thematic focuses on Italian Twitter corpora will be provided, aiming at producing complementary information for Official Statistics purposes.