Title: Sample size and predictive performance of machine learning methods with survival data.
Authors: Federico Ambrogi - University of Milan (Italy)
Rosalba Miceli - Fondazione IRCCS Istituto Nazionale dei Tumori (Italy)
Gabriele Infante - University of Milan (Italy) [presenting]
Abstract: Prediction models are increasingly developed and used in diagnostic and prognostic studies, where the use of Machine learning (ML) methods is becoming more and more popular over traditional regression techniques. For survival outcomes, the Cox proportional hazards model is largely used and it has been proven to achieve good prediction performances with few strong covariates. The possibility to improve the model performance by including non-linearities, covariate interactions and time-varying effects while controlling for overfitting must be carefully considered during the model building phase. On the other hand, ML techniques are able to learn complexities from data at the cost of hyper-parameter tuning and interpretability. One aspect of special interest is the sample size needed for developing a survival prediction model. While there is guidance when using traditional statistical models, the same does not apply when using ML techniques. A time-to-event simulation framework is developed to evaluate the performance of the Cox regression compared, among others, to tuned Random Survival Forest, Gradient Boosting and Neural Networks at varying sample sizes. We used simulations based on replications of subjects from publicly available databases, where event times were simulated according to a Cox model with non-linearities on continuous variables and time-varying effects. The SEER registry data were used for comparison with real-world data.