Abstract

Malware is becoming a severe threat globally, affecting many industries. With the growing number of new malware variants every day, the traditional signature-based methods cannot cope and fail to detect them. Hence, there is a serious need for more effective systems to detect malware and protect people from facing losses due to malware attacks. This paper aims to detect malware using machine learning algorithms and compare the results. The appropriate dataset was collected and cleaned by encoding non-numeric data. The correlation-based feature selection method was applied to choose the essential features and drop the independent features that were highly correlated with each other. The dataset was then divided into training and testing sets. Different algorithms like Random Forest, Naïve Bayes, Logistic Regression, and Support Vector Machine were trained using the training set to build the machine learning model. The models were then tested using the testing set. Different parameters like accuracy, true positive rate, false positive rate, F1-score, precision, ROC curve, and confusion matrix were used to analyze and compare the performance of each algorithm. Random Forest scored the highest accuracy of 99%, followed by Support Vector Machine with an accuracy of 94%. In contrast, Naïve Bayes and Logistic Regression performed poorly, achieving only 63% and 52% accuracy, respectively. The results of this research will enhance the strength of cybersecurity systems through advanced malware detection. It will help protect businesses and individuals from the risk of exposing sensitive information, such as identity documents, financial statements, and confidential proprietary information. In addition, it may avoid operational interruptions, harm to reputation, loss of data, and massive financial damage that may result from ransomware, corporate spying, and intercontinental service interruptions.

Keywords

Malware detection, Correlation-based feature selection, Logistic Regression, Post-processing, , Random Forest, Support Vector Machine,

Downloads

Download data is not yet available.

References

  1. M. Sikorski, A. Honig. (2012). Practical malware analysis: the hands-on guide to dissecting malicious software. No starch press.
  2. D. Ucci, L. Aniello, R. Baldoni, A taxonomy of malware behavior for classification and detection. Journal of Computer Virology and Hacking Techniques, 15(2), (2019) 67–90.
  3. L. Coventry, D. Branley, and T. Chesney, Cybersecurity threats and the human factor: Investigating the moderating effect of personal values on cybersecurity awareness. Computers & Security, 105, (2021) 102247.
  4. M. Conti, A. Dehghantanha, K. Franke, S. Watson, Internet of things security and forensics: Challenges and opportunities. Future Generation Computer Systems, 78, (2018) 544–546. https://doi.org/10.1016/j.future.2017.07.060
  5. R. Amuthakkannan, S.M. Kannan, K. Vijayalakshmi, N. Ramaraj, Reliability Analysis of Programmable Mechatronics System Using Bayesian Approach. International Journal of Industrial and Systems Engineering, 4(3), (2009) 303-325. https://doi.org/10.1504/IJISE.2009.023544
  6. K. Vijayalakshmi, Reliability improvement in component-based software development environment. International Journal of Information Systems and Change Management, 5(2), (2011) 99–123. https://doi.org/10.1504/IJISCM.2011.041510
  7. T. Fox, Cybercrime to cost the world $12.2 trillion annually by 2031, Cybersecurity Ventures – 2025 Official Cybercrime Report (Blog), May 28, 2025. [Online]. Available: https://cybersecurityventures.com/official-cybercrime-report-2025/
  8. M. F. Safitra, M. Lubis, and H. Fakhrurroja, Counterattacking cyber threats: A framework for the future of cybersecurity, Sustainability, 15 (2023) 13369. https://doi.org/10.3390/su151813369
  9. A.L. Buczak, E. Guven, A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), (2016) 1153–1176. https://doi.org/10.1109/COMST.2015.2494502
  10. A. Bensaoud, J. Kalita, M. Bensaoud, A survey of malware detection using deep learning. Machine Learning with Applications, 16, (2024) 100546. https://doi.org/10.1016/j.mlwa.2024.100546
  11. A. Pinhero, M.L. Anupama, P. Vinod, C.A. Visaggio, N. Aneesh, S. Abhijith, S. AnanthaKumar, Malware detection employed by visualization and deep neural network. Computers & Security, 105, (2021) 102247. https://doi.org/10.1016/j.cose.2021.102247
  12. A. Hussain, A. Saadia, A. Gul, M. Alhussein, K. Aurangzeb, Enhancing ransomware defense: Deep learning-based detection and family-wise classification of evolving threats. PeerJ Computer Science, 10, (2024) e2546. https://doi.org/10.7717/peerj-cs.2546
  13. H.A.K. Harsha, T.A. Murthy, Machine-learning techniques for malware detection. International Journal of Scientific Research in Science, Engineering and Technology, 8(5), (2021) 70–76. https://doi.org/10.32628/IJSRSET21858
  14. R. Amuthakkannan, Parameters Design and Performance Analysis of a Software‐Based Mechatronics System Using Taguchi Robust Design – A Case Study. International Journal of Productivity and Quality Management. 10(1), (2012) 1–24. https://dx.doi.org/10.1504/IJPQM.2012.047939
  15. K. Vijayalakshmi, N. Ramaraj, R. Amuthakkannan, Improvement of component selection process using genetic algorithm for component-based software development. International Journal of Information Systems and Change Management, 3(1), (2008) 63–80. https://dx.doi.org/10.1504/IJISCM.2008.019289
  16. [M.S. Akhtar, T. Feng, Malware analysis and detection using machine-learning algorithms. Symmetry, 14(11), (2022) 2304. https://doi.org/10.3390/sym14112304
  17. M. Masum, M.J.H. Faruk, H. Shahriar, K. Qian, D. Lo, M.I. Adnan, Ransomware classification and detection with machine learning algorithms. In 2022 IEEE 12th annual computing and communication workshop and conference (CCWC), IEEE, Las Vegas, NV, USA, 0316-0322.
  18. O. Bawazeer, T. Helmy, S. Al-Hadhrami, Malware detection using machine learning algorithms based on hardware performance counters: Analysis and simulation. Journal of Physics: Conference Seriesvol. 1962(1), (2021) 012010. https://doi.org/10.1088/1742-6596/1962/1/012010
  19. Z. Xu, J. Liu, Z. Yang, G. An, and X. Jia, “The impact of feature selection on defect prediction performance: An empirical comparison,” in Proc. 2016 IEEE 27th Int. Symp. on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 2016, pp. 309–320. https://doi.org/10.1109/ISSRE.2016.13
  20. M. Alshomrani, A. Albeshri, A. A. Alsulami, and B. Alturki, An explainable hybrid CNN–Transformer architecture for visual malware classification, Sensors, 25(15), (2025) 4581. https://doi.org/10.3390/s25154581
  21. M. Alshamrani, S. Alshammari, M. Alasmary, Federated learning for malware detection in IoT edge networks. Future Generation Computer Systems, 146, (2024) 109–120. https://doi.org/10.1016/j.comnet.2021.108693
  22. H. Shokouhinejad, R. Razavi-Far, H. Mohammadian, M. Rabbani, S. Ansong, G. Higgins, A.A. Ghorbani, (2025) recent advances in malware detection: Graph learning and explainability. arXiv preprint arXiv:2502.10556. https://doi.org/10.48550/arXiv.2502.10556
  23. B.P. Gond, D.P. Mohapatra, (2025) deep learning-driven malware classification with API call sequence analysis and concept drift handling. arXiv preprint arXiv:2502.08679. https://doi.org/10.48550/arXiv.2502.08679
  24. Y.A.M. Alsumaidaee, M.M. Yahya, A.H. Yaseen, Optimizing malware detection and classification in real-time using hybrid deep learning approaches. International Journal of Safety and Security Engineering, 15(1), (2025) 15–25. https://doi.org/10.18280/ijsse.150115
  25. D.M. Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. International Journal of Machine Learning Technology, 2(1), (2011) 37–63. https://doi.org/10.48550/arXiv.2010.16061
  26. Q. Abu Al-Haija, A. Odeh, H. Qattous, PDF malware detection based on optimizable decision trees. Electronics, 11(19), (2022) 3142. https://doi.org/10.3390/electronics11193142
  27. N. Saravana, (2018) Malware Detection | Kaggle, [Online]. Available: https://www.kaggle.com/datasets/nsaravana/malware-detection?select=Malware+dataset.csv
  28. J. T. Hancock and T. M. Khoshgoftaar, Survey on categorical data for neural networks, Journal of Big Data, 7(1) (2020) 28. https://doi.org/10.1186/s40537-020-00305-w
  29. G. Chandrashekar, F. Sahin, A survey on feature selection methods. Comput. Electr. Eng., 40(1), (2014) 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
  30. J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6), (2017) 1-45. https://doi.org/10.1145/3136625
  31. M.A. Hall. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceeding University of Waikato.
  32. M. Belgiu, L. Drăguț, Random forest in remote sensing: A review of applications and future directions, ISPRS J. Photogramm. Remote Sens., 114, (2016) 24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011
  33. G. Biau, E. Scornet, A random forest guided tour. Test, 25(2), (2016) 197–227. https://doi.org/10.1007/s11749-016-0481-7
  34. C. Cortes, V. Vapnik, Support-vector networks. Maching Learning, 20(3), (1995) 273–297. https://doi.org/10.1007/BF00994018
  35. H. Zhang. (2004). the optimality of naïve Bayes. in Proc. 17th Int. Florida Artificial Intelligence Research Society Conf. (FLAIRS), Miami Beach, FL, USA, 562–567.
  36. C.Y.J. Peng, K.L. Lee, G.M. Ingersoll, An introduction to logistic regression analysis and reporting. The journal of educational research, 96(1), (2002) 3–14. https://doi.org/10.1080/00220670209598786
  37. J. Ding, V. Tarokh, Y. Yang, Model selection techniques: An overview. IEEE Signal Processing Magazine, 35(6), (2018) 16–34. https://doi.org/10.1109/MSP.2018.2867638
  38. Rakibul Hasan, Barna Biswas, Md Samiun, Mohammad Abu Saleh, Mani Prabha, Jahanara Akter, Fatema Haque Joya & Masuk Abdullah, Enhancing malware detection with feature selection and scaling techniques using machine learning models, Scientific Reports, 15 (2025) 9122. https://doi.org/10.1038/s41598-025-93447-x