CMStatistics 2022: Start Registration
View Submission - CMStatistics
Title: Using web site text to identify different types of companies Authors:  Piet Daas - Eindhoven University of Technology (Netherlands) [presenting]
Abstract: Different types of companies are identified based on - differences in - the texts on their websites. This approach has been used to identify innovative and platform economy companies in the Netherlands and drone companies in several European countries. Usually, an initial test is performed to determine if (and how much) the website texts for the topic studied actually differ. For this, at least 2000 company website texts, including 50\% positive and 50\% negative cases, are routinely used. Survey data or expert findings are used to determine the actual type of company. Next, the website's texts are preprocessed, and various classification algorithms, included in the scikit-learn library of Python, are applied to determine which of them is best able to discern between the positive and negative cases; e.g. platform vs non-platform. In addition, the effect of adding WordEmbeddings-based features is routinely tested. We found that logistic regression with WordEmbeddings worked best to detect innovation (accuracy 88\%), linear-SVM worked best for platform economy websites (accuracy 82\%), and logistic regression worked best to detect drone companies (accuracy 82-86\%). The results are usually used to obtain a small subset of companies for the type studied that are subsequently investigated further.