Developing and validating a survival prediction model for NSCLC patients through distributed learning across three countries

Arthur Jochems, Timo M. Deist, Issam El Naqa, Marc Kessler, Chuck Mayo, Jackson Reeves, Shruti Jolly, Martha Matuszak, Randall Ten Haken, Johan van Soest, Cary Oberije, Corinne Faivre-Finn, Gareth Price, Dirk de Ruysscher, Philippe Lambin, Andre Dekker



Tools for survival prediction for non-small cell lung cancer (NSCLC) patients treated with (chemo)radiotherapy are of limited quality. In this work, we develop a predictive model of survival at two years based on a large volume of historical patient data, as a proof of concept, using a distributed learning approach.

Patients and methods

Clinical data from 698 lung cancer patients, treated with curative intent with chemoradiation (CRT) or radiotherapy (RT) alone were collected and stored in 2 different cancer institutes (559 patients at Maastro clinic (Netherlands), 139 at University of Manchester (UK). The model was further validated on 196 patients originating from the University of Michigan (USA).

A Bayesian network model is adapted for distributed learning (watch the animation). Two-year post-treatment survival was chosen as endpoint. The Institute 1 cohort data is publicly available and the developed models can be found at


Variables included in the final model were T and N stage, age, performance status, and total tumor dose. The model has an AUC of 0.66 on the external validation set and an AUC of 0.62 on a 5-fold cross-validation. A model based on T and N stage performed with an AUC of 0.47 on the validation set, significantly worse than our model (P<0.001). A high- and low-risk chance of survival group can be identified using the model presented in this study, these groups have significantly different overall survival (P<0.01).


Distributed learning from federated databases allows learning of predictive models on data originating from multiple institutions while avoiding many of the data sharing barriers. We believe that Distributed learning is the future of sharing data in health care.


File Jochems-2017-MaastroDataUnbinned.csv62.76 KB