Abstract
Malware is becoming a severe threat globally, affecting many industries. With the growing number of new malware variants every day, the traditional signature-based methods cannot cope and fail to detect them. Hence, there is a serious need for more effective systems to detect malware and protect people from facing losses due to malware attacks. This paper aims to detect malware using machine learning algorithms and compare the results. The appropriate dataset was collected and cleaned by encoding non-numeric data. The correlation-based feature selection method was applied to choose the essential features and drop the independent features that were highly correlated with each other. The dataset was then divided into training and testing sets. Different algorithms like Random Forest, Naïve Bayes, Logistic Regression, and Support Vector Machine were trained using the training set to build the machine learning model. The models were then tested using the testing set. Different parameters like accuracy, true positive rate, false positive rate, F1-score, precision, ROC curve, and confusion matrix were used to analyze and compare the performance of each algorithm. Random Forest scored the highest accuracy of 99%, followed by Support Vector Machine with an accuracy of 94%. In contrast, Naïve Bayes and Logistic Regression performed poorly, achieving only 63% and 52% accuracy, respectively. The results of this research will enhance the strength of cybersecurity systems through advanced malware detection. It will help protect businesses and individuals from the risk of exposing sensitive information, such as identity documents, financial statements, and confidential proprietary information. In addition, it may avoid operational interruptions, harm to reputation, loss of data, and massive financial damage that may result from ransomware, corporate spying, and intercontinental service interruptions.
Keywords
Malware detection, Correlation-based feature selection, Logistic Regression, Post-processing, , Random Forest, Support Vector Machine,Downloads
References
- M. Sikorski, A. Honig. (2012). Practical malware analysis: the hands-on guide to dissecting malicious software. No starch press.
- D. Ucci, L. Aniello, R. Baldoni, A taxonomy of malware behavior for classification and detection. Journal of Computer Virology and Hacking Techniques, 15(2), (2019) 67–90.
- L. Coventry, D. Branley, and T. Chesney, Cybersecurity threats and the human factor: Investigating the moderating effect of personal values on cybersecurity awareness. Computers & Security, 105, (2021) 102247.
- M. Conti, A. Dehghantanha, K. Franke, S. Watson, Internet of things security and forensics: Challenges and opportunities. Future Generation Computer Systems, 78, (2018) 544–546. https://doi.org/10.1016/j.future.2017.07.060
- R. Amuthakkannan, S.M. Kannan, K. Vijayalakshmi, N. Ramaraj, Reliability Analysis of Programmable Mechatronics System Using Bayesian Approach. International Journal of Industrial and Systems Engineering, 4(3), (2009) 303-325. https://doi.org/10.1504/IJISE.2009.023544
- K. Vijayalakshmi, Reliability improvement in component-based software development environment. International Journal of Information Systems and Change Management, 5(2), (2011) 99–123. https://doi.org/10.1504/IJISCM.2011.041510
- T. Fox, Cybercrime to cost the world $12.2 trillion annually by 2031, Cybersecurity Ventures – 2025 Official Cybercrime Report (Blog), May 28, 2025. [Online]. Available: https://cybersecurityventures.com/official-cybercrime-report-2025/
- M. F. Safitra, M. Lubis, and H. Fakhrurroja, Counterattacking cyber threats: A framework for the future of cybersecurity, Sustainability, 15 (2023) 13369. https://doi.org/10.3390/su151813369
- A.L. Buczak, E. Guven, A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), (2016) 1153–1176. https://doi.org/10.1109/COMST.2015.2494502
- A. Bensaoud, J. Kalita, M. Bensaoud, A survey of malware detection using deep learning. Machine Learning with Applications, 16, (2024) 100546. https://doi.org/10.1016/j.mlwa.2024.100546
- A. Pinhero, M.L. Anupama, P. Vinod, C.A. Visaggio, N. Aneesh, S. Abhijith, S. AnanthaKumar, Malware detection employed by visualization and deep neural network. Computers & Security, 105, (2021) 102247. https://doi.org/10.1016/j.cose.2021.102247
- A. Hussain, A. Saadia, A. Gul, M. Alhussein, K. Aurangzeb, Enhancing ransomware defense: Deep learning-based detection and family-wise classification of evolving threats. PeerJ Computer Science, 10, (2024) e2546. https://doi.org/10.7717/peerj-cs.2546
- H.A.K. Harsha, T.A. Murthy, Machine-learning techniques for malware detection. International Journal of Scientific Research in Science, Engineering and Technology, 8(5), (2021) 70–76. https://doi.org/10.32628/IJSRSET21858
- R. Amuthakkannan, Parameters Design and Performance Analysis of a Software‐Based Mechatronics System Using Taguchi Robust Design – A Case Study. International Journal of Productivity and Quality Management. 10(1), (2012) 1–24. https://dx.doi.org/10.1504/IJPQM.2012.047939
- K. Vijayalakshmi, N. Ramaraj, R. Amuthakkannan, Improvement of component selection process using genetic algorithm for component-based software development. International Journal of Information Systems and Change Management, 3(1), (2008) 63–80. https://dx.doi.org/10.1504/IJISCM.2008.019289
- [M.S. Akhtar, T. Feng, Malware analysis and detection using machine-learning algorithms. Symmetry, 14(11), (2022) 2304. https://doi.org/10.3390/sym14112304
- M. Masum, M.J.H. Faruk, H. Shahriar, K. Qian, D. Lo, M.I. Adnan, Ransomware classification and detection with machine learning algorithms. In 2022 IEEE 12th annual computing and communication workshop and conference (CCWC), IEEE, Las Vegas, NV, USA, 0316-0322.
- O. Bawazeer, T. Helmy, S. Al-Hadhrami, Malware detection using machine learning algorithms based on hardware performance counters: Analysis and simulation. Journal of Physics: Conference Seriesvol. 1962(1), (2021) 012010. https://doi.org/10.1088/1742-6596/1962/1/012010
- Z. Xu, J. Liu, Z. Yang, G. An, and X. Jia, “The impact of feature selection on defect prediction performance: An empirical comparison,” in Proc. 2016 IEEE 27th Int. Symp. on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 2016, pp. 309–320. https://doi.org/10.1109/ISSRE.2016.13
- M. Alshomrani, A. Albeshri, A. A. Alsulami, and B. Alturki, An explainable hybrid CNN–Transformer architecture for visual malware classification, Sensors, 25(15), (2025) 4581. https://doi.org/10.3390/s25154581
- M. Alshamrani, S. Alshammari, M. Alasmary, Federated learning for malware detection in IoT edge networks. Future Generation Computer Systems, 146, (2024) 109–120. https://doi.org/10.1016/j.comnet.2021.108693
- H. Shokouhinejad, R. Razavi-Far, H. Mohammadian, M. Rabbani, S. Ansong, G. Higgins, A.A. Ghorbani, (2025) recent advances in malware detection: Graph learning and explainability. arXiv preprint arXiv:2502.10556. https://doi.org/10.48550/arXiv.2502.10556
- B.P. Gond, D.P. Mohapatra, (2025) deep learning-driven malware classification with API call sequence analysis and concept drift handling. arXiv preprint arXiv:2502.08679. https://doi.org/10.48550/arXiv.2502.08679
- Y.A.M. Alsumaidaee, M.M. Yahya, A.H. Yaseen, Optimizing malware detection and classification in real-time using hybrid deep learning approaches. International Journal of Safety and Security Engineering, 15(1), (2025) 15–25. https://doi.org/10.18280/ijsse.150115
- D.M. Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. International Journal of Machine Learning Technology, 2(1), (2011) 37–63. https://doi.org/10.48550/arXiv.2010.16061
- Q. Abu Al-Haija, A. Odeh, H. Qattous, PDF malware detection based on optimizable decision trees. Electronics, 11(19), (2022) 3142. https://doi.org/10.3390/electronics11193142
- N. Saravana, (2018) Malware Detection | Kaggle, [Online]. Available: https://www.kaggle.com/datasets/nsaravana/malware-detection?select=Malware+dataset.csv
- J. T. Hancock and T. M. Khoshgoftaar, Survey on categorical data for neural networks, Journal of Big Data, 7(1) (2020) 28. https://doi.org/10.1186/s40537-020-00305-w
- G. Chandrashekar, F. Sahin, A survey on feature selection methods. Comput. Electr. Eng., 40(1), (2014) 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
- J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6), (2017) 1-45. https://doi.org/10.1145/3136625
- M.A. Hall. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceeding University of Waikato.
- M. Belgiu, L. Drăguț, Random forest in remote sensing: A review of applications and future directions, ISPRS J. Photogramm. Remote Sens., 114, (2016) 24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011
- G. Biau, E. Scornet, A random forest guided tour. Test, 25(2), (2016) 197–227. https://doi.org/10.1007/s11749-016-0481-7
- C. Cortes, V. Vapnik, Support-vector networks. Maching Learning, 20(3), (1995) 273–297. https://doi.org/10.1007/BF00994018
- H. Zhang. (2004). the optimality of naïve Bayes. in Proc. 17th Int. Florida Artificial Intelligence Research Society Conf. (FLAIRS), Miami Beach, FL, USA, 562–567.
- C.Y.J. Peng, K.L. Lee, G.M. Ingersoll, An introduction to logistic regression analysis and reporting. The journal of educational research, 96(1), (2002) 3–14. https://doi.org/10.1080/00220670209598786
- J. Ding, V. Tarokh, Y. Yang, Model selection techniques: An overview. IEEE Signal Processing Magazine, 35(6), (2018) 16–34. https://doi.org/10.1109/MSP.2018.2867638
- Rakibul Hasan, Barna Biswas, Md Samiun, Mohammad Abu Saleh, Mani Prabha, Jahanara Akter, Fatema Haque Joya & Masuk Abdullah, Enhancing malware detection with feature selection and scaling techniques using machine learning models, Scientific Reports, 15 (2025) 9122. https://doi.org/10.1038/s41598-025-93447-x
Articles

