Abstract

Addressing the class imbalance in classification problems is particularly challenging, especially in the context of medical datasets where misclassifying minority class samples can have significant repercussions. This study is dedicated to mitigating class imbalance in medical datasets by employing a hybrid approach that combines data-level, cost-sensitive, and ensemble methods. Through an assessment of the performance, measured by AUC-ROC values, Sensitivity, F1-Score, and G-Mean of 20 data-level and four cost-sensitive models on seventeen medical datasets - 12 small and five large, a hybridized model, SMOTE-RF-CS-LR has been devised. This model integrates the Synthetic Minority Oversampling Technique (SMOTE), the ensemble classifier Random Forest (RF), and the Cost-Sensitive Logistic Regression (CS-LR). Upon testing the hybridized model on diverse imbalanced ratios, it demonstrated remarkable performance, achieving outstanding performance values on the majority of the datasets. Further examination of the model's training duration and time complexity revealed its efficiency, taking less than a second to train on each small dataset. Consequently, the proposed hybridized model not only proves to be time-efficient but also exhibits robust capabilities in handling class imbalance, yielding outstanding classification results in the context of medical datasets.

Keywords

Class Imbalance, Medical Data, Voting Ensemble, Machine Learning,

Downloads

Download data is not yet available.

References

  1. N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), (2002) 429-449.
  2. A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance problem. International Journal of Advances in Soft Computing and its Applications, 5(3), (2013) 176–204.
  3. M.C. Monard, G. Batista, Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics, 85, (2002) 173–180.
  4. J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, M. Asadpour, Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7, (2020) 1–47. https://doi.org/10.1186/s40537-020-00349-y
  5. G. Aguiar, B. Krawczyk, A. Cano, A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning, (2023) 1–79. https://doi.org/10.1007/s10994-023-06353-6
  6. P. Kaur, A. Gosain, Issues and challenges of class imbalance problem in classification. International Journal of Information Technology, 14(1), (2022) 539–545. https://doi.org/10.1007/s41870-018-0251-8
  7. A.S. Desuky, S. Hussain, An improved hybrid approach for handling class imbalance problem. Arabian Journal for Science and Engineering, 46, (2021) 3853–3864. https://doi.org/10.1007/s13369-021-05347-7
  8. M. Mohamad, A. Selamat, I.M. Subroto, O. Krejcar, Improving the classification performance on imbalanced data sets via new hybrid parameterisation model. Journal of King Saud University-Computer and Information Sciences, 33(7), (2021) 787–797. https://doi.org/10.1016/j.jksuci.2019.04.009
  9. F. Feng, K.C. Li, E. Yang, Q. Zhou, L. Han, A. Hussain, M. Cai, A novel oversampling and feature selection hybrid algorithm for imbalanced data classification. Multimedia Tools and Applications, 82(3), (2023) 3231–3267. https://doi.org/10.1007/s11042-022-13240-0
  10. D. Elreedy, A.F. Atiya, F. Kamalov, A theoretical distribution analysis of synthetic minority oversampling technique (smote) for imbalanced learning. Machine Learning, (2023) 1–21. https://doi.org/10.1007/s10994-022-06296-4
  11. M. Prince, P.J. Prathap, An imbalanced dataset and class overlapping classification model for big data. Computer Systems Science and Engineering, 44(2), (2023) 1009–1024. https://doi.org/10.32604/csse.2023.024277
  12. K. Ahlawat, A. Chug, A.P. Singh, Benchmarking framework for class imbalance problem using novel sampling approach for big data. International Journal of System Assurance Engineering and Management, 10, (2019) 824–835. https://doi.org/10.1007/s13198-019-00817-6
  13. A.S. Qureshi, T. Roos, Transfer learning with ensembles of deep neural networks for skin cancer detection in imbalanced data sets. Neural Processing Letters, 55(4), (2023) 4461–4479. https://doi.org/10.1007/s11063-022-11049-4
  14. A.M. Sowjanya, O. Mrudula, Effective treatment of imbalanced datasets in health care using modified smote coupled with stacked deep learning algorithms. Applied Nanoscience, 13(3), (2023) 1829–1840. https://doi.org/10.1007/s13204-021-02063-4
  15. M. kumari, P. Ahlawat, Dcpm: An effective and robust approach for diabetes classification and prediction. International Journal of Information Technology, 13, (2021) 1079–1088. https://doi.org/10.1007/s41870-021-00656-4
  16. S. Chatterjee, S. Maity, M. Bhattacharjee, S. Banerjee, A.K. Das, W. Ding, Variational autoencoder based imbalanced covid19 detection using chest x-ray images. New Generation Computing, 41(1), (2023) 25–60. https://doi.org/10.1007/s00354-022-00194-y
  17. R. Vij, S. Arora, A systematic review on diabetic retinopathy detection using deep learning techniques. Archives of Computational Methods in Engineering, 30(3), (2023) 2211–2256. https://doi.org/10.1007/s11831-022-09862-0
  18. R. Vij, & S. Arora, A hybrid evolutionary weighted ensemble of deep transfer learning models for retinal vessel segmentation and diabetic retinopathy detection. Computers and Electrical Engineering, 115, (2024) 109107. https://doi.org/10.1016/j.compeleceng.2024.109107
  19. C.X. Ling, C. Li, Data mining for direct marketing: Problems and solutions. In: Kdd, 98, (1998) 73–79
  20. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, (2002) 321–357. https://doi.org/10.1613/jair.953
  21. J. Prusa, T.M. Khoshgoftaar, D.J. Dittman, A. Napolitano, (2015) Using random undersampling to alleviate class imbalance on tweet sentiment data. IEEE International Conference on Information Reuse and Integration, IEEE. San Francisco. https://doi.org/10.1109/IRI.2015.39
  22. D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (3), (1972) 408–421. https://doi.org/10.1109/TSMC.1972.4309137
  23. G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), (2004) 20–29. https://doi.org/10.1145/1007730.1007735
  24. R.E. Wright, Logistic regression. Reading and understanding multivariate statistics, (1995) 217–244.
  25. J.R. Quinlan, Simplifying decision trees. International journal of man-machine studies, 27(3), (1987) 221–234. https://doi.org/10.1016/S0020-7373(87)80053-6
  26. M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines. IEEE Intelligent Systems and their applications, 13(4), (1998) 18–28. https://doi.org/10.1109/5254.708428
  27. T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, (2016) 785–794. https://doi.org/10.1145/2939672.2939785
  28. A. Gupta, A. Chug, and A. P. Singh, “Processing and optimized learning for improved classification of categorical plant disease datasets,” Intelligent Data Analysis, no. Preprint, pp. 1–25.
  29. L. Breiman, Random forests. Machine learning, 45(1), (2001) 5–32. https://doi.org/10.1023/A:1010933404324
  30. W. Wolberg, O. Mangasarian, N. Street, W. Street, (1995) Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository.
  31. W. Wolberg, (1992) Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.
  32. Borzooei, S., Tarokhian, A.: Differentiated Thyroid Cancer Recurrence. UCI Machine Learning Repository. (2023)
  33. E. Tasci, K. Camphausen, A.V. Krauze, Y. Zhuge, (2022) Glioma Grading Clinical and Mutation Features. UCI Machine Learning Repository.
  34. T. Ahmad, A. Munir, S.H. Bhatti, M. Aftab, M.A. Raz, (2020) Heart Failure Clinical Records Dataset. UCI Machine Learning Repository.
  35. B. Ramana, N. Venkateswarlu, (2012) ILPD (Indian Liver Patient Dataset). UCI Machine Learning Repository. https://doi.org/10.24432/C5D02C
  36. Early stage diabetes risk prediction dataset. (2020) UCI Machine Learning Repository.
  37. M. Amin, A. Ali, (2018). Caesarian section classification dataset. UCI Machine Learning Repository.
  38. A. Bhardwaj, (2022) Framingham heart study dataset. Kaggle.
  39. B. Antal, A. Hajdu, (2014) Diabetic Retinopathy Debrecen. UCI Machine Learning Repository.
  40. M. Lubicz, K. Pawelczyk, A. Rzechonek, J. Kolodziej, (2013) Thoracic Surgery Data. UCI Machine Learning Repository.
  41. D. Dua, C. Graff, (2017) UCI Machine Learning Repository.
  42. V. Vidulin, M. Lustrek, B. Kaluza, R. Piltaver, J. Krivec. (2010) Localization Data for Person Activity. UCI Machine Learning Repository.
  43. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, (2011) 2825–2830.
  44. G. Lemaˆıtre, F. Nogueira, C.K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), (2017) 1–5.
  45. X. Xiaolong, C. Wen, S. Yanfei, Oversampling algorithm for imbalanced data classification. Journal of Systems Engineering and Electronics 30(6), (2019) 1182–1191. https://doi.org/10.21629/JSEE.2019.06.12