CMStatistics 2020: Start Registration
View Submission - CFE
Title: Multilingual entity resolution with semi-supervised machine learning techniques: An application to KYC Authors:  Ilgiz Mustafin - Innopolis University (Russia) [presenting]
George Emelyanov - Schwarzthal Tech (Russia)
Marius Frunza - Schwarzthal Tech (United Kingdom)
Abstract: The aim is to explore several entity resolution strategies that can be used when the various datasets are recorded in different languages. When trying to link records from different datasets we encounter two main issues. On the one hand, matching records names in different languages brings a significant challenge due to the fact that transliteration is not a bijective well-defined function. On the other hand, matching records based solely on names generates a high number of false-positives. For the first issue, we apply a strategy based on neural networks trained on a predefined dataset on which we add a set of expert-based rules. We address the second issue with a Bayesian approach tacking into account the apriori distribution of names frequencies in our datasets. Our approach is tested on business registries datasets extracted from different sources like government registers or leaked documents. The most notable difference is the way how names are presented in different languages of the documents. The existing techniques are extended, and a method is presented for finding matches in business networks effectively. The results of this research can be used for coupling national business registers containing local data about companies and owners recorded in multiple languages. Thus, the suggested approach can improve the frameworks used currently in the Know Your Client (KYC) process.