Enhanced Classification of Imbalanced Medical Datasets using Hybrid Data-Level, Cost-Sensitive and Ensemble Methods

Ayushi Gupta; Shikha Gupta

doi:10.54392/irjmt2435

Articles

Home / Archives / Volume 6, Issue 3, Year 2024 /

DOI: 10.54392/irjmt2435

Enhanced Classification of Imbalanced Medical Datasets using Hybrid Data-Level, Cost-Sensitive and Ensemble Methods

Ayushi Gupta⁺⁻
Shikha Gupta⁺⁻

Department of Computer Science, Shaheed Sukhdev College of Business Studies, University of Delhi, Sector 16, Rohini, Delhi, India.

Dimensions

Plum Analytics

Abstract

Addressing the class imbalance in classification problems is particularly challenging, especially in the context of medical datasets where misclassifying minority class samples can have significant repercussions. This study is dedicated to mitigating class imbalance in medical datasets by employing a hybrid approach that combines data-level, cost-sensitive, and ensemble methods. Through an assessment of the performance, measured by AUC-ROC values, Sensitivity, F1-Score, and G-Mean of 20 data-level and four cost-sensitive models on seventeen medical datasets - 12 small and five large, a hybridized model, SMOTE-RF-CS-LR has been devised. This model integrates the Synthetic Minority Oversampling Technique (SMOTE), the ensemble classifier Random Forest (RF), and the Cost-Sensitive Logistic Regression (CS-LR). Upon testing the hybridized model on diverse imbalanced ratios, it demonstrated remarkable performance, achieving outstanding performance values on the majority of the datasets. Further examination of the model's training duration and time complexity revealed its efficiency, taking less than a second to train on each small dataset. Consequently, the proposed hybridized model not only proves to be time-efficient but also exhibits robust capabilities in handling class imbalance, yielding outstanding classification results in the context of medical datasets.

Keywords

Class Imbalance, Medical Data, Voting Ensemble, Machine Learning,

Downloads

Download data is not yet available.

References

N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), (2002) 429-449.
A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance problem. International Journal of Advances in Soft Computing and its Applications, 5(3), (2013) 176–204.
M.C. Monard, G. Batista, Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics, 85, (2002) 173–180.
J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, M. Asadpour, Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7, (2020) 1–47. https://doi.org/10.1186/s40537-020-00349-y
G. Aguiar, B. Krawczyk, A. Cano, A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning, (2023) 1–79. https://doi.org/10.1007/s10994-023-06353-6
P. Kaur, A. Gosain, Issues and challenges of class imbalance problem in classification. International Journal of Information Technology, 14(1), (2022) 539–545. https://doi.org/10.1007/s41870-018-0251-8
A.S. Desuky, S. Hussain, An improved hybrid approach for handling class imbalance problem. Arabian Journal for Science and Engineering, 46, (2021) 3853–3864. https://doi.org/10.1007/s13369-021-05347-7
M. Mohamad, A. Selamat, I.M. Subroto, O. Krejcar, Improving the classification performance on imbalanced data sets via new hybrid parameterisation model. Journal of King Saud University-Computer and Information Sciences, 33(7), (2021) 787–797. https://doi.org/10.1016/j.jksuci.2019.04.009
F. Feng, K.C. Li, E. Yang, Q. Zhou, L. Han, A. Hussain, M. Cai, A novel oversampling and feature selection hybrid algorithm for imbalanced data classification. Multimedia Tools and Applications, 82(3), (2023) 3231–3267. https://doi.org/10.1007/s11042-022-13240-0
D. Elreedy, A.F. Atiya, F. Kamalov, A theoretical distribution analysis of synthetic minority oversampling technique (smote) for imbalanced learning. Machine Learning, (2023) 1–21. https://doi.org/10.1007/s10994-022-06296-4
M. Prince, P.J. Prathap, An imbalanced dataset and class overlapping classification model for big data. Computer Systems Science and Engineering, 44(2), (2023) 1009–1024. https://doi.org/10.32604/csse.2023.024277
K. Ahlawat, A. Chug, A.P. Singh, Benchmarking framework for class imbalance problem using novel sampling approach for big data. International Journal of System Assurance Engineering and Management, 10, (2019) 824–835. https://doi.org/10.1007/s13198-019-00817-6
A.S. Qureshi, T. Roos, Transfer learning with ensembles of deep neural networks for skin cancer detection in imbalanced data sets. Neural Processing Letters, 55(4), (2023) 4461–4479. https://doi.org/10.1007/s11063-022-11049-4
A.M. Sowjanya, O. Mrudula, Effective treatment of imbalanced datasets in health care using modified smote coupled with stacked deep learning algorithms. Applied Nanoscience, 13(3), (2023) 1829–1840. https://doi.org/10.1007/s13204-021-02063-4
M. kumari, P. Ahlawat, Dcpm: An effective and robust approach for diabetes classification and prediction. International Journal of Information Technology, 13, (2021) 1079–1088. https://doi.org/10.1007/s41870-021-00656-4
S. Chatterjee, S. Maity, M. Bhattacharjee, S. Banerjee, A.K. Das, W. Ding, Variational autoencoder based imbalanced covid19 detection using chest x-ray images. New Generation Computing, 41(1), (2023) 25–60. https://doi.org/10.1007/s00354-022-00194-y
R. Vij, S. Arora, A systematic review on diabetic retinopathy detection using deep learning techniques. Archives of Computational Methods in Engineering, 30(3), (2023) 2211–2256. https://doi.org/10.1007/s11831-022-09862-0
R. Vij, & S. Arora, A hybrid evolutionary weighted ensemble of deep transfer learning models for retinal vessel segmentation and diabetic retinopathy detection. Computers and Electrical Engineering, 115, (2024) 109107. https://doi.org/10.1016/j.compeleceng.2024.109107
C.X. Ling, C. Li, Data mining for direct marketing: Problems and solutions. In: Kdd, 98, (1998) 73–79
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, (2002) 321–357. https://doi.org/10.1613/jair.953
J. Prusa, T.M. Khoshgoftaar, D.J. Dittman, A. Napolitano, (2015) Using random undersampling to alleviate class imbalance on tweet sentiment data. IEEE International Conference on Information Reuse and Integration, IEEE. San Francisco. https://doi.org/10.1109/IRI.2015.39
D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics (3), (1972) 408–421. https://doi.org/10.1109/TSMC.1972.4309137
G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), (2004) 20–29. https://doi.org/10.1145/1007730.1007735
R.E. Wright, Logistic regression. Reading and understanding multivariate statistics, (1995) 217–244.
J.R. Quinlan, Simplifying decision trees. International journal of man-machine studies, 27(3), (1987) 221–234. https://doi.org/10.1016/S0020-7373(87)80053-6
M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines. IEEE Intelligent Systems and their applications, 13(4), (1998) 18–28. https://doi.org/10.1109/5254.708428
T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system. Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, (2016) 785–794. https://doi.org/10.1145/2939672.2939785
A. Gupta, A. Chug, and A. P. Singh, “Processing and optimized learning for improved classification of categorical plant disease datasets,” Intelligent Data Analysis, no. Preprint, pp. 1–25.
L. Breiman, Random forests. Machine learning, 45(1), (2001) 5–32. https://doi.org/10.1023/A:1010933404324
W. Wolberg, O. Mangasarian, N. Street, W. Street, (1995) Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository.
W. Wolberg, (1992) Breast Cancer Wisconsin (Original). UCI Machine Learning Repository.
Borzooei, S., Tarokhian, A.: Differentiated Thyroid Cancer Recurrence. UCI Machine Learning Repository. (2023)
E. Tasci, K. Camphausen, A.V. Krauze, Y. Zhuge, (2022) Glioma Grading Clinical and Mutation Features. UCI Machine Learning Repository.
T. Ahmad, A. Munir, S.H. Bhatti, M. Aftab, M.A. Raz, (2020) Heart Failure Clinical Records Dataset. UCI Machine Learning Repository.
B. Ramana, N. Venkateswarlu, (2012) ILPD (Indian Liver Patient Dataset). UCI Machine Learning Repository. https://doi.org/10.24432/C5D02C
Early stage diabetes risk prediction dataset. (2020) UCI Machine Learning Repository.
M. Amin, A. Ali, (2018). Caesarian section classification dataset. UCI Machine Learning Repository.
A. Bhardwaj, (2022) Framingham heart study dataset. Kaggle.
B. Antal, A. Hajdu, (2014) Diabetic Retinopathy Debrecen. UCI Machine Learning Repository.
M. Lubicz, K. Pawelczyk, A. Rzechonek, J. Kolodziej, (2013) Thoracic Surgery Data. UCI Machine Learning Repository.
D. Dua, C. Graff, (2017) UCI Machine Learning Repository.
V. Vidulin, M. Lustrek, B. Kaluza, R. Piltaver, J. Krivec. (2010) Localization Data for Person Activity. UCI Machine Learning Repository.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, (2011) 2825–2830.
G. Lemaˆıtre, F. Nogueira, C.K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), (2017) 1–5.
X. Xiaolong, C. Wen, S. Yanfei, Oversampling algorithm for imbalanced data classification. Journal of Systems Engineering and Electronics 30(6), (2019) 1182–1191. https://doi.org/10.21629/JSEE.2019.06.12

Downloads

PDF

Article Details

Volume 6, Issue 3, Year 2024

DOI: 10.54392/irjmt2435

Published 2024-04-22

How to Cite

Gupta, Ayushi, and Shikha Gupta. 2024. “Enhanced Classification of Imbalanced Medical Datasets Using Hybrid Data-Level, Cost-Sensitive and Ensemble Methods”. International Research Journal of Multidisciplinary Technovation 6 (3):58-76. https://doi.org/10.54392/irjmt2435.

Copyrights & License

This work is licensed under a Creative Commons Attribution 4.0 International License.