Title: Modelling unbalanced catastrophic health expenditure data
Authors: Songul Cinaroglu - Hacettepe University (Turkey) [presenting]
Abstract: Traditional parametric statistical learning methods such as logistic regression (LR), perform poorly at predicting class-imbalanced data. Random Forest (RF) is an algorithmic statistical method to deal with unbalanced data. We compare performances of LR and RF classifiers predicting households faced with catastrophic out-of-pocket (OOP) health expenditure, while using a balanced oversampling procedure. Data came from nationally representative household budget data from the Turkish Statistical Institute for the year 2012. The number of households for which the surveys were valid was 9987 for the year 2012. WHOs methodology was employed to calculate catastrophic OOP health expenditure. The degree of imbalance is higher and the percentage of households faced with catastrophic OOP health expenditure is 0.14\%. LR and RF models are compared based on eight common risk factors. A balanced oversampling was used and 31 artificial datasets were generated changing from 5\% and 98\% of original data size. Accuracy, sensitivity, specificity, precision and F-measure were used to evaluate classifiers. ROC curve was used to compare the performance of the classification models. Balanced oversampling data has more accurate predictions and RF is superior to identify households faced with catastrophic OOP health expenditure.