CMStatistics 2017: Start Registration
View Submission - CMStatistics
B1656
Title: Exploring and improving document-to-vector embeddings Authors:  Erik Mathiesen - Octavia.ai (United Kingdom) [presenting]
Chan Ford - Queen Mary University (United Kingdom)
Abstract: With the invention of word2vec, glove, doc2vec and similar techniques vector space embeddings and deep learning techniques have become the de-facto standard for modern NLP applications. Despite this there are very few detailed studies of the performance and efficiency of the many embedding techniques available, especially when applied to real-word tasks. Instead the choice of method used for text to vector embedding is often made based on rule-of-thumb or random choice. Setting out to determine what should be the preferred method for our particular application we, along the way, discovered several new techniques as well as a variety of interesting facts about the existing standard methods. Based on this work we will present a detailed review and evaluation of a number of existing document embedding techniques as well as several completely new techniques that emerged from our study. We will show how our novel methods perform in a range of scenarios compared to the commonly used techniques and conclude when these should be utilised.