Title: Website and text classification using natural language processing
Authors: Martin Wood - Office for National Statistics (United Kingdom) [presenting]
Abstract: Company websites provide copious information on the economic activity of businesses, useful when measuring the modern economy. This information is in the form of natural language, making automated harvesting difficult. We review our experience with different statistical models generated to represent the documents (web pages) we wish to classify, and the methods and algorithms we can then apply to these representations. It is necessary to find a representation which extracts information at the level of abstraction of the concept we wish to classify or analyse within the document. For example, in determining if a website is used for sales, Bag-Of-Word (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) models allow classifiers to identify diagnostic keywords. Conversely, if we wish to identify more general business activity (e.g., mining), sophisticated representations built using Latent Dirichlet Allocation (LDA) or word2vec that characterise word associations are required. These representations can be extended with part-of-speech tagging, n-grams, metadata about the location of occurrence of the text on the website structure and by combining representations together in to larger feature sets. In all cases, sophisticated classifiers such as Random Forests and Support Vector Machines that are robust in the face of high-dimensional data perform best in supervised classification of the data.