b iframe width height src https www
The remainder of this Seladelpar paper is organized as follows. Section 2 introduces the related work on the application of machine learning methods to cancer survivability prediction and ensemble learning methods. Section 3 presents details of the pro-posed two-stage model. Section 4 describes data preparation, experimental design, and experimental results together with discussions of the results. Section 5 concludes the paper and suggests future research.
2. Related work
A large body of research has studied the prediction of cancer survivability using data-driven models. In this section, research on cancer survival prediction with machine learning methods is reviewed, followed by a brief introduction to en-semble learning models for prediction performance improvement.
2.1. Machine learning in cancer survival prediction
Survival analysis is an important task in cancer prognosis, which can be achieved by methods and techniques based on the historical clinical data of patients. At present, machine learning methods have been widely used to construct prediction models in cancer survivability research to achieve more e cient and accurate medical decision-making .
Several studies have dedicated to improving the accuracy of cancer survivability prediction. In 1994, Burke  compared a few statistical models with TNM staging system, which has been used to predict the outcomes of cancer patients since the early 1960 s. The accuracy of several statistical models was much higher than the TNM staging system in predicting five-year breast cancer survivability. With the increasing use of machine learning techniques, subsequent studies mainly focused on comparing machine learning methods with statistical methods to demonstrate that machine learning methods can be used in predicting cancer survivability more effectively. Ali et al.  summarized various data mining and machine learning methods that could be used to predict breast cancer survivability. Their results showed that the accuracy of these methods was largely higher than traditional statistical based systems. Delen  applied three popular machine learning methods including decision tree, artificial neural network, support vector machine, and one of the most commonly used statistical methods, logistic regression, to predict prostate cancer survivability. The experimental results showed that support vector machines exhibited the highest prediction accuracy, followed by decision trees and artificial neural networks, while the logistic regression performed the worst.
Hybrid models were also proposed to improve the prediction accuracy of cancer survivability. Khan et al.  analyzed the feasibility of predicting cancer patients’ survivability using fuzzy logic based classifier and combined fuzzy set theory with the decision tree to construct weighted a fuzzy decision tree (wFDT). Their results on the SEER breast cancer datasets showed that the prediction performance of wFDT was better than that of the decision tree. Wang et al.  combined the synthetic minority oversampling technique (SMOTE) with a particle swarm optimization algorithm (PSO) and one of the three classification algorithms (logistic regression, k-Nearest Neighbor (KNN) and decision tree) to form a new classification method for breast cancer survival prediction, where the function of SMOTE was to adjust the original imbalanced data and PSO was for feature selection. 10-fold cross validation was carried out and the combination of SMOTE and PSO, C5 decision tree in the experiment showed the best classification performance. This research showed that the hybrid algorithm could effectively improve the accuracy of breast cancer survivability prediction.
In recent years, ensemble learning methods that train a number of base learners and then combine their outputs with a certain strategy  are attractive because of their good performance. One of the popular ensemble learning methods is random forests. Edeki et al.  applied six methods to the breast cancer dataset from SEER, including logistics regression, decision tree, artificial neural network (multilayer perceptron), support vector machine, AdaBoost, bagging, and random for-est. Their results indicated that the performance of each algorithm depends on the characteristics of the dataset, such as the size, the quality and data representation. It was showed that random forest has a better accuracy to predict cancer surviv-ability. Boughorbel et al.  compared several classification techniques for breast cancer prognosis, and found that random forest achieved the best AUC performance compared with boosted trees, a partial least square model, a generalized linear model (GLM), a GLMNet, a support vector machine, a neural network and the KNN. Zolbanin et al.  predicted the overall survivability in comorbidity of cancers with the prediction models generated by logistic regression, artificial neural network, decision tree and random forest, and it was shown random forest achieved the highest accuracy rate. AdaBoost is also a kind of popular ensemble method. Thongkam et al.  applied three versions of AdaBoost (real AdaBoost, gentle AdaBoost and modest AdaBoost) to predict breast cancer survivability, and modest AdaBoost had the highest accuracy either before or after pre-processing.