PREDICTING LUNG CANCER USING EXPLAINABLE ARTIFICIAL INTELLIGENCE AND BORUTA-SHAP METHODS
Year 2024,
, 792 - 803, 03.09.2024
Erkan Akkur
,
Ahmet Cankat Öztürk
Abstract
Machine learning algorithms, a popular approach for disease prediction in recent years, can also be used to predict lung cancer, which has fatal effects. A prediction model based on machine learning algorithms is proposed to predict lung cancer. Five decision tree-based algorithms were preferred as classifiers. The experiment was conducted on a publicly available data set that contained risk factors. The Boruta-SHAP approach was employed to reveal the most salient features in the dataset. The use of the feature selection method improved the performance of the classifiers in the prediction process. Experiments were conducted using all features and reduced features separately. When comparing all the classifiers' performances, the XGBoost algorithm produced the best prediction rate with an accuracy of 97.22% and an AUROC of 0.972. The proposed model has a good classification rate compared to similar studies in the literature. We used the SHAP (SHapley Additive exPlanation) approach to investigate the effect of risk factors in the dataset on the model output. As a result, allergy was found to be the most significant risk factor for this disease.
References
- Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., & Bray, F. (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 71(3), 209-249.
- Li, C., Lei, S., Ding, L., Xu, Y., Wu, X., Wang, H., Zhang, Z., Gao, T., Zhang, Y., Li, L. (2023). Global burden and trends of lung cancer incidence and mortality. Chin Med J (Engl), 136(13):1583-1590
- Latimer, K. M., & Mott, T. F. (2015). Lung cancer: diagnosis, treatment principles, and screening. American family physician, 91(4), 250-256.
- Kaplanoglu, E., & Nasab, A. (2023). Evaluation of artificial intelligence techniques in disease diagnosis and prediction. Discover Artificial Intelligence, 3(1).
Turk, F. &. Kokver, Y. (2022). Application with deep learning models for COVID-19 diagnosis, SAUCIS, vol. 5, no. 2, pp. 169–180.
Turk, F., Luy, M., Barıscı, N. & Yalcınkaya, F., (2022), Kidney tumour segmentation using two-stage bottleneck block architecture, Intelligent Automation and Soft Computing, 33(1).
- Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79.
- Theng, D., & Bhoyar, K. K. (2023). Feature selection techniques for machine learning: a survey of more than two decades of research. Knowledge and Information Systems, 1-63.
- Confalonieri, R., Coba, L., Wagner, B., & Besold, T. R. (2021). A historical perspective of explainable Artificial Intelligence. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(1), e1391.
- Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., ... & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58, 82-115.
- Faisal, M. I., Bashir, S., Khan, Z. S., & Khan, F. H. (2018, December). An evaluation of machine learning classifiers and ensembles for early-stage prediction of lung cancer. In 2018 3rd international conference on emerging trends in engineering, sciences and technology (ICEEST) (pp. 1-4). IEEE.
- Patra, R. (2020). Prediction of lung cancer using machine learning classifier. In: Chaubey, N., Parikh, S., Amin, K. (eds) Computing Science, Communication and Security. COMS2 2020. Communications in Computer and Information Science, vol 1235. Springer, Singapore. DOI: 10.1007/978-981-15-6648-6_11.
- Abuya, T.K. (2023). Lung Cancer Prediction from Elvira Biomedical Dataset Using Ensemble Classifier with Principal Component Analysis. Journal of Data Analysis and Information Processing, 11, 175-199.
- Agarwal S., Thakur S. and Chaudhary A. (2022, October). Prediction of lung cancer using machine learning techniques and their comparative analysis. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India. DOI: 10.1109/ICRITO56286.2022.9965052.
- Dritsas, E., & Trigka, M. (2022). Lung cancer risk prediction with machine learning models. Big Data and Cognitive Computing, 6(4), 139.
- Dirik, M. (2023). Machine learning-based lung cancer diagnosis. Turkish Journal of Engineering, 7(4), 322-330.
Nasser, I. M., & Abu-Naser, S. S. (2019). Lung cancer detection using artificial neural network. International Journal of Engineering and Information Systems (IJEAIS), 3(3), 17-23.
- Omar A. C. and Nassif A. B. (2023). Lung cancer prediction using machine learning based feature selection: A comparative Study, 2023 Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab Emirates, pp. 1-6. DOI: 10.1109/ASET56582.2023.10180436.
- Ojha T. (2023), Machine learning based classification and detection of lung cancer, Journal of Artificial Intelligence and Capsule Networks 5(2):110-128.
- Lung Cancer Prediction Dataset (2013). Available online: https://www.kaggle.com/datasets/m ysarahmadbhat/lungcancer?
fbclid=IwAR0uQ5K3mEbQZJcwQGYqlLJ5RydvsK2oU1Sa5vYvit0ECoqkx6vPR43JAM. / Accessed 02.01.2024.
- He, H., Bai, Y., Garcia, E.A., Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. DOI: 10.1109/IJCNN.2008.4633969.
- Kursa MB, Rudnicki WR. (2010). Feature selection with the Boruta package. J. Stat. Softw. 36(11): 1-13.
Keany E. (2020). Boruta-Shap: A wrapper feature selection method which combines the Boruta feature selection algorithm with Shapley values. Zenodo: Geneva, Switzerland.
- Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20-28.
- Tsiligaridis J., (2023). Tree-Based ensemble models and algorithms for classification, 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Bali, Indonesia, pp. 103-106.
- Palimkar, P., Shaw, R.N., Ghosh, (2022). A. Machine learning technique to prognosis diabetes disease: Random forest classifier approach. In Advanced Computing and Intelligent Technologies; Springer: Berlin/Heidelberg, Germany, pp. 219–244.
- Geurts P., Ernst D. & Wehenkel L. (2006). Extremely randomized trees, Machine Learning, vol.63, pp.3-42.
Chen T. & Guestrin C. (2016). XGBoost: A scalable tree boosting system. In Proc. of the 22Nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining pp. 785–94.
- Wang, R. (2012). AdaBoost for feature selection, classification and its relation with SVM, a review. Physics Procedia, 25, 800-807.
- Lundberg S.M. & Lee S.I. (2017). A unified approach to interpreting model predictions.” Advances in neural information processing systems, 30.
- Yao L., Leng Z., Jiang J. & Ni F. (2022). Modelling of pavement performance evolution considering uncertainty and interpretability: a machine learning based framework, International Journal og Pavement Engineering, 23(14):5211-5226.
- Kim, J. Lee, J. & Park, M. (2022). Identification of smartwatch-collected lifelog variables affecting body mass index in middle-aged people using regression machine learning algorithms and SHapley Additive Explanations. Appl. Sci. 12, 3819.
- Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
AÇIKLANABİLİR YAPAY ZEKA VE BORUTA-SHAP YÖNTEMLERİYLE AKCİĞER KANSERİNİN ÖNGÖRÜLMESİ
Year 2024,
, 792 - 803, 03.09.2024
Erkan Akkur
,
Ahmet Cankat Öztürk
Abstract
Son yıllarda hastalık tahmini için popüler bir yaklaşım olan makine öğrenmesi algoritmaları, ölümcül etkileri olan akciğer kanserinin tahmininde de kullanılabilir. Bu çalışmada, akciğer kanserini tahmin etmek için makine öğrenmesi algoritmalarına dayalı bir tahmin modeli önerilmiştir. Sınıflandırıcı olarak beş karar ağacı tabanlı algoritma tercih edilmiştir. Deney, risk faktörlerini içeren kamuya açık bir veri seti üzerinde gerçekleştirilmiştir. Veri setindeki en belirgin özellikleri ortaya çıkarmak için Boruta-SHAP yaklaşımı kullanılmıştır. Öznitelik seçim yönteminin kullanılması sınıflandırıcılarının tahmin işleminde göstermiş oldukları performansları artırmıştır. Deneyler tüm özellikler ve indirgenmiş özellikler ayrı ayrı kullanılarak gerçekleştirilmiştir. Tüm sınıflandırıcıların performansları karşılaştırıldığında, 97.22% doğruluk ve 0.972 AUROC ile en iyi tahmin oranını üreten XGBoost algoritması olmuştur. Önerilen model, literatürdeki benzer çalışmalara kıyasla iyi bir sınıflandırma oranına sahiptir. Veri setindeki risk faktörlerinin model çıktısı üzerindeki etkisini araştırmak için SHAP (SHapley Additive exPlanation) yaklaşımını kullandık. Sonuç olarak, alerji bu hastalık için en önemli risk faktörü olarak bulunmuştur.
Ethical Statement
Bu çalışmada kamuya açık erişimi olan bir veri seti kullanıldı. Bu yüzden, etik kurul iznine ihtiyaç bulunmamaktadır.
References
- Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., & Bray, F. (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians, 71(3), 209-249.
- Li, C., Lei, S., Ding, L., Xu, Y., Wu, X., Wang, H., Zhang, Z., Gao, T., Zhang, Y., Li, L. (2023). Global burden and trends of lung cancer incidence and mortality. Chin Med J (Engl), 136(13):1583-1590
- Latimer, K. M., & Mott, T. F. (2015). Lung cancer: diagnosis, treatment principles, and screening. American family physician, 91(4), 250-256.
- Kaplanoglu, E., & Nasab, A. (2023). Evaluation of artificial intelligence techniques in disease diagnosis and prediction. Discover Artificial Intelligence, 3(1).
Turk, F. &. Kokver, Y. (2022). Application with deep learning models for COVID-19 diagnosis, SAUCIS, vol. 5, no. 2, pp. 169–180.
Turk, F., Luy, M., Barıscı, N. & Yalcınkaya, F., (2022), Kidney tumour segmentation using two-stage bottleneck block architecture, Intelligent Automation and Soft Computing, 33(1).
- Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79.
- Theng, D., & Bhoyar, K. K. (2023). Feature selection techniques for machine learning: a survey of more than two decades of research. Knowledge and Information Systems, 1-63.
- Confalonieri, R., Coba, L., Wagner, B., & Besold, T. R. (2021). A historical perspective of explainable Artificial Intelligence. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(1), e1391.
- Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., ... & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, 58, 82-115.
- Faisal, M. I., Bashir, S., Khan, Z. S., & Khan, F. H. (2018, December). An evaluation of machine learning classifiers and ensembles for early-stage prediction of lung cancer. In 2018 3rd international conference on emerging trends in engineering, sciences and technology (ICEEST) (pp. 1-4). IEEE.
- Patra, R. (2020). Prediction of lung cancer using machine learning classifier. In: Chaubey, N., Parikh, S., Amin, K. (eds) Computing Science, Communication and Security. COMS2 2020. Communications in Computer and Information Science, vol 1235. Springer, Singapore. DOI: 10.1007/978-981-15-6648-6_11.
- Abuya, T.K. (2023). Lung Cancer Prediction from Elvira Biomedical Dataset Using Ensemble Classifier with Principal Component Analysis. Journal of Data Analysis and Information Processing, 11, 175-199.
- Agarwal S., Thakur S. and Chaudhary A. (2022, October). Prediction of lung cancer using machine learning techniques and their comparative analysis. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India. DOI: 10.1109/ICRITO56286.2022.9965052.
- Dritsas, E., & Trigka, M. (2022). Lung cancer risk prediction with machine learning models. Big Data and Cognitive Computing, 6(4), 139.
- Dirik, M. (2023). Machine learning-based lung cancer diagnosis. Turkish Journal of Engineering, 7(4), 322-330.
Nasser, I. M., & Abu-Naser, S. S. (2019). Lung cancer detection using artificial neural network. International Journal of Engineering and Information Systems (IJEAIS), 3(3), 17-23.
- Omar A. C. and Nassif A. B. (2023). Lung cancer prediction using machine learning based feature selection: A comparative Study, 2023 Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab Emirates, pp. 1-6. DOI: 10.1109/ASET56582.2023.10180436.
- Ojha T. (2023), Machine learning based classification and detection of lung cancer, Journal of Artificial Intelligence and Capsule Networks 5(2):110-128.
- Lung Cancer Prediction Dataset (2013). Available online: https://www.kaggle.com/datasets/m ysarahmadbhat/lungcancer?
fbclid=IwAR0uQ5K3mEbQZJcwQGYqlLJ5RydvsK2oU1Sa5vYvit0ECoqkx6vPR43JAM. / Accessed 02.01.2024.
- He, H., Bai, Y., Garcia, E.A., Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. DOI: 10.1109/IJCNN.2008.4633969.
- Kursa MB, Rudnicki WR. (2010). Feature selection with the Boruta package. J. Stat. Softw. 36(11): 1-13.
Keany E. (2020). Boruta-Shap: A wrapper feature selection method which combines the Boruta feature selection algorithm with Shapley values. Zenodo: Geneva, Switzerland.
- Charbuty, B., & Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends, 2(01), 20-28.
- Tsiligaridis J., (2023). Tree-Based ensemble models and algorithms for classification, 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Bali, Indonesia, pp. 103-106.
- Palimkar, P., Shaw, R.N., Ghosh, (2022). A. Machine learning technique to prognosis diabetes disease: Random forest classifier approach. In Advanced Computing and Intelligent Technologies; Springer: Berlin/Heidelberg, Germany, pp. 219–244.
- Geurts P., Ernst D. & Wehenkel L. (2006). Extremely randomized trees, Machine Learning, vol.63, pp.3-42.
Chen T. & Guestrin C. (2016). XGBoost: A scalable tree boosting system. In Proc. of the 22Nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining pp. 785–94.
- Wang, R. (2012). AdaBoost for feature selection, classification and its relation with SVM, a review. Physics Procedia, 25, 800-807.
- Lundberg S.M. & Lee S.I. (2017). A unified approach to interpreting model predictions.” Advances in neural information processing systems, 30.
- Yao L., Leng Z., Jiang J. & Ni F. (2022). Modelling of pavement performance evolution considering uncertainty and interpretability: a machine learning based framework, International Journal og Pavement Engineering, 23(14):5211-5226.
- Kim, J. Lee, J. & Park, M. (2022). Identification of smartwatch-collected lifelog variables affecting body mass index in middle-aged people using regression machine learning algorithms and SHapley Additive Explanations. Appl. Sci. 12, 3819.
- Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.