Improving Machine-Learning Malware Detection Through IQR-Based Feature Reduction
DOI:
10.33395/sinkron.v10i1.15634Keywords:
Malware, Machine Learning, Interquartile Range, XGBoost, Malware DetectionAbstract
Malware detection is a significant challenge in cybersecurity due to the complex and evolving nature of threats. This study evaluates the effectiveness of machine learning algorithms, specifically XGBoost and LightGBM, in detecting malware. The approach includes data cleaning, normalization, feature selection, and the use of the Interquartile Range (IQR) technique to select relevant features. The initial dataset contained 21,752 files, evenly split between malicious and benign files. After data cleaning, the number of samples decreased to 19,256 files, with numerous features that were reduced after applying IQR. Results show that XGBoost outperforms other algorithms, achieving 99.20% accuracy, an improvement over the 98.99% accuracy without IQR. The IQR technique enhances data quality by filtering out features with significant differences between malware and benign files, improving model performance. Additionally, reducing the feature set helps prevent overfitting and strengthens the model's generalization ability. The study concludes that machine learning, particularly with algorithms like XGBoost and LightGBM, can effectively improve malware detection. By using IQR in feature selection, model performance is enhanced, leading to reduced false positives and increased detection efficiency. The research highlights the importance of feature selection techniques like IQR in boosting the predictive power of machine learning models, making them more efficient in identifying malware. Future work will explore additional feature selection methods to further improve malware detection accuracy.
Downloads
References
Aditya, & Dash, C. S. (2024). LightGBM-Powered Solutions for Backdoor Malware Detection in SCADA Networks. 2024 International Conference on Communication, Computing and Energy Efficient Technologies (I3CEET), 1765–1771. https://doi.org/10.1109/I3CEET61722.2024.10994149
Akhtar, M. S., & Feng, T. (2022). Malware Analysis and Detection Using Machine Learning Algorithms. Symmetry, 14(11), 2304. https://doi.org/10.3390/sym14112304
Amato, A., & Di Lecce, V. (2023). Data preprocessing impact on machine learning algorithm performance. Open Computer Science, 13(1), 20220278. https://doi.org/10.1515/comp-2022-0278
Arganda, E., Perez, A. D., De Los Rios, M., & Sandá Seoane, R. M. (2023). Machine-learned exclusion limits without binning. The European Physical Journal C, 83(12), 1158. https://doi.org/10.1140/epjc/s10052-023-12314-z
Azerbaijan Technical University, Baku, Azerbaijan, & Asgarov, K. (2025). REAL-TIME ENDPOINT ANOMALY DETECTION USING ADAPTIVE STATISTICAL METHODS FOR BASELINE DEVIATIONS. Problems of Information Technology, 16(1), 11–17. https://doi.org/10.25045/jpit.v16.i1.02
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70–79. https://doi.org/10.1016/j.neucom.2017.11.077
Chen, J., & Zhang, G. (2024). Detecting Stealthy Ransomware in IPFS Networks Using Machine Learning. Open Science Framework. https://doi.org/10.31219/osf.io/38ex9
Cherif, I. L., & Kortebi, A. (2019). On using eXtreme Gradient Boosting (XGBoost) Machine Learning algorithm for Home Network Traffic Classification. 2019 Wireless Days (WD), 1–6. https://doi.org/10.1109/WD.2019.8734193
El Hajj Hassan, S., & Duong-Trung, N. (2024). Machine Learning in Cybersecurity: Advanced Detection and Classification Techniques for Network Traffic Environments. EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 11(3). https://doi.org/10.4108/eetinis.v11i3.5237
Fan, M., & Chunsheng, Y. (2025). Ensemble-Based Machine Learning Algorithm for Intelligent Network Security Threat Detection. Informatica, 49(7). https://doi.org/10.31449/inf.v49i7.6640
Heydarian, M., Doyle, T. E., & Samavi, R. (2022). MLCM: Multi-Label Confusion Matrix. IEEE Access, 10, 19083–19095. https://doi.org/10.1109/ACCESS.2022.3151048
Hilabi, R., & Abu-Khadrah, A. (2024). Windows operating system malware detection using machine learning. Bulletin of Electrical Engineering and Informatics, 13(5), 3401–3410. https://doi.org/10.11591/eei.v13i5.8018
Hussain, A. (2024). Ransomware Dataset 2024 (1.0) [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.13890887
Jadhav, P., Bhavsar, P., Deore, S., Kuyate, K., & Bendale, Miss. G. (2024). Detecting Malware Activity Using Machine Learning. IJARCCE, 13(4). https://doi.org/10.17148/IJARCCE.2024.13453
Jaiswal, J. K., & Samikannu, R. (2017). Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression. 2017 World Congress on Computing and Communication Technologies (WCCCT), 65–68. https://doi.org/10.1109/WCCCT.2016.25
Khan, Z., Naeem, M., Khalil, U., Khan, D. M., Aldahmani, S., & Hamraz, M. (2019). Feature Selection for Binary Classification Within Functional Genomics Experiments via Interquartile Range and Clustering. IEEE Access, 7, 78159–78169. https://doi.org/10.1109/ACCESS.2019.2922432
Kumar, K., Parveen, A., Hasan, F., Kumar, A., Jain, A., & Kumar, V. (2024). Malware Attack Detection Using Machine Learning Techniques. 2024 4th Asian Conference on Innovation in Technology (ASIANCON), 1–4. https://doi.org/10.1109/ASIANCON62057.2024.10838032
Malik, S. (2021). The Machine Learning in Malware Detection: Muhammad Shairoze Malik. International Journal for Electronic Crime Investigation, 5, 29–36. https://doi.org/10.54692/ijeci.2022.050387
Navada, A., Ansari, A. N., Patil, S., & Sonkamble, B. A. (2011). Overview of use of decision tree algorithms in machine learning. 2011 IEEE Control and System Graduate Research Colloquium, 37–42. https://doi.org/10.1109/ICSGRC.2011.5991826
Nguyen, Q. H., Ly, H.-B., Ho, L. S., Al-Ansari, N., Le, H. V., Tran, V. Q., Prakash, I., & Pham, B. T. (2021). Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Mathematical Problems in Engineering, 2021, 1–15. https://doi.org/10.1155/2021/4832864
Nikam, U. V., & Deshmuh, V. M. (2022). Performance Evaluation of Machine Learning Classifiers in Malware Detection. 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), 1–5. https://doi.org/10.1109/ICDCECE53908.2022.9793102
Patil, B. A., S, S. A., & M G, Dr. J. (2023). Detection of Malware using Machine Learning Approach. International Journal for Research in Applied Science and Engineering Technology, 11(8), 736–741. https://doi.org/10.22214/ijraset.2023.55233
Poudyal, S., Subedi, K. P., & Dasgupta, D. (2018). A Framework for Analyzing Ransomware using Machine Learning. 2018 IEEE Symposium Series on Computational Intelligence (SSCI), 1692–1699. https://doi.org/10.1109/SSCI.2018.8628743
Rahat, A. M., Kahir, A., & Masum, A. K. M. (2019). Comparison of Naive Bayes and SVM Algorithm based on Sentiment Analysis Using Review Dataset. 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART), 266–270. https://doi.org/10.1109/SMART46866.2019.9117512
Rymarczyk, T., Kozłowski, E., Kłosowski, G., & Niderla, K. (2019). Logistic Regression for Machine Learning in Process Tomography. Sensors, 19(15), 3400. https://doi.org/10.3390/s19153400
Sun, X., Liu, M., & Sima, Z. (2020). A novel cryptocurrency price trend forecasting model based on LightGBM. Finance Research Letters, 32, 101084. https://doi.org/10.1016/j.frl.2018.12.032
Varshney, G., Varshney, S., Suman, A., Chouhan, K., & Suman, P. (2023). Machine Learning Based Malware Detection System. 2023 3rd International Conference on Advancement in Electronics & Communication Engineering (AECE), 559–563. https://doi.org/10.1109/AECE59614.2023.10428565
Verdonck, T., Baesens, B., Óskarsdóttir, M., & Vanden Broucke, S. (2024). Special issue on feature engineering editorial. Machine Learning, 113(7), 3917–3928. https://doi.org/10.1007/s10994-021-06042-2
Wahid, M. F., Tafreshi, R., Al-Sowaidi, M., & Langari, R. (2018). Subject-independent hand gesture recognition using normalization and machine learning algorithms. Journal of Computational Science, 27, 69–76. https://doi.org/10.1016/j.jocs.2018.04.019
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2026 Nurcahyo Fajar Setyanto, Rina Pramitasari, Jeki Kuswanto

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Moraref
PKP Index
Indonesia OneSearch
OCLC Worldcat
Index Copernicus
Scilit




















