Integrating SMOTE with XGBoost for Robust Classification on Imbalanced Datasets: A Dual-Domain Evaluation
DOI:
10.33395/sinkron.v9i3.15029Keywords:
SMOTE, machine learning, Imbalanced Data, Classification, XGBoostAbstract
Class imbalance is one of the main challenges in classification problems, as it can reduce the model's ability to accurately identify minority classes and negatively impact the overall reliability of predictions. In response to this problem, this study proposes an integrated approach combining SMOTE and XGBoost to improve classification performance on imbalanced data. This approach aims to evaluate the impact of oversampling techniques on prediction accuracy and model sensitivity to class distribution. The evaluation was conducted using two public datasets representing different domains and different amounts of data, namely Spambase and Diabetes, to assess the effectiveness and generalization of the applied approach. The experimental results show that this integrated model consistently outperforms traditional comparison algorithms, with an F1 score of 0.94 and ROC-AUC of 0.98 on the Spambase dataset and ROC-AUC of 0.83 on the Diabetes dataset, with a good balance between precision and recall. The 10-fold cross-validation technique was applied to ensure objective performance estimates free from random data splitting bias. Additionally, this study highlights the importance of selecting appropriate evaluation metrics in the context of imbalanced data, as single accuracy often provides a misleading performance picture. This study makes a significant contribution by providing a benchmark for comparing the effectiveness of SMOTE-XGBoost integration using two different datasets, accompanied by rigorous cross-validation. These findings reinforce the position of integrating data preprocessing strategies and ensemble learning as a competitive and adaptive solution for addressing class imbalance challenges in data-driven classification systems.
Downloads
References
Abdul Bujang, S. D., Selamat, A., Krejcar, O., Mohamed, F., Cheng, L. K., Chiu, P. C., & Fujita, H. (2023). Imbalanced Classification Methods for Student Grade Prediction: A Systematic Literature Review. In IEEE Access (Vol. 11, pp. 1970–1989). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ACCESS.2022.3225404
Adeoti Babajide Ebenezer1, B. O. K. (PhD) 2 , O. M. I. (2021). A Comprehensive Analysis of Handling Imbalanced Dataset. International Journal of Advanced Trends in Computer Science and Engineering, 10(2), 454–463. https://doi.org/10.30534/ijatcse/2021/031022021
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-August-2016, 785–794. https://doi.org/10.1145/2939672.2939785
Dablain, D., Krawczyk, B., & Chawla, N. V. (2023). DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. https://doi.org/10.1109/TNNLS.2021.3136503
Devan, P., & Khare, N. (2020). An efficient XGBoost–DNN-based classification model for network intrusion detection system. Neural Computing and Applications, 32(16), 12499–12514. https://doi.org/10.1007/s00521-020-04708-x
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Information Sciences, 501, 118–135. https://doi.org/10.1016/j.ins.2019.06.007
Elreedy, D., & Atiya, A. F. (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32–64. https://doi.org/10.1016/j.ins.2019.07.070
Halim, A. M., Dwifebri, M., & Nhita, F. (2023). Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets. Building of Informatics, Technology and Science (BITS), 5(1). https://doi.org/10.47065/bits.v5i1.3647
Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123. https://doi.org/10.1007/s10994-009-5119-5
Hasanah, U., Soleh, A. M., & Sadik, K. (2024). Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models. Jurnal Matematika, Statistika Dan Komputasi, 21(1), 88–102. https://doi.org/10.20956/j.v21i1.35552
Husain, G., Nasef, D., Jose, R., Mayer, J., Bekbolatova, M., Devine, T., & Toma, M. (2025). SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models. Algorithms, 18(1). https://doi.org/10.3390/a18010037
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R Second Edition.
Jiang, J., Zhang, C., Ke, L., Hayes, N., Zhu, Y., Qiu, H., Zhang, B., Zhou, T., & Wei, G. W. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. In Chemical Science (Vol. 16, Issue 18, pp. 7637–7658). Royal Society of Chemistry. https://doi.org/10.1039/d5sc00270b
Kim, B., & Kim, J. (2020). Adjusting decision boundary for class imbalanced learning. IEEE Access, 8, 81674–81685. https://doi.org/10.1109/ACCESS.2020.2991231
Nemade, B., Bharadi, V., Alegavi, S. S., & Marakarkandy, B. (n.d.). International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING A Comprehensive Review: SMOTE-Based Oversampling Methods for Imbalanced Classification Techniques, Evaluation, and Result Comparisons. In Original Research Paper International Journal of Intelligent Systems and Applications in Engineering IJISAE (Vol. 2023, Issue 9s). www.ijisae.org
Nobre, J., & Neves, R. F. (2019). Combining Principal Component Analysis, Discrete Wavelet Transform and XGBoost to trade in the financial markets. Expert Systems with Applications, 125, 181–194. https://doi.org/10.1016/j.eswa.2019.01.083
Sharma, A., Singh, P. K., & Chandra, R. (2022). SMOTified-GAN for Class Imbalanced Pattern Classification Problems. IEEE Access, 10, 30655–30665. https://doi.org/10.1109/ACCESS.2022.3158977
Suandi, F., Anam, M. K., Firdaus, M. B., Fadli, S., Lathifah, L., Yumami, E., Saleh, A., & Hasibuan, A. Z. (2024). Enhancing Sentiment Analysis Performance Using SMOTE and Majority Voting in Machine Learning Algorithms (pp. 126–138). https://doi.org/10.2991/978-94-6463-620-8_10
Sugihartono, T., Wijaya, B., Marini, Alkayes, A. F., & Anugrah, H. A. (2025). Optimizing Stunting Detection through SMOTE and Machine Learning: a Comparative Study of XGBoost, Random Forest, SVM, and k-NN. Journal of Applied Data Sciences, 6(1), 667–682. https://doi.org/10.47738/jads.v6i1.494
Sun, J., Li, H., Fujita, H., Fu, B., & Ai, W. (2020). Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Information Fusion, 54, 128–144. https://doi.org/10.1016/j.inffus.2019.07.006
Ujaran, K., Ridwan, K., Heni Hermaliani, E., Ernawati, M., & Author, C. (2024). Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada. In Computer Science (CO-SCIENCE (Vol. 4, Issue 1). http://jurnal.bsi.ac.id/index.php/co-science
Verbakel, J. Y., Steyerberg, E. W., Uno, H., De Cock, B., Wynants, L., Collins, G. S., & Van Calster, B. (2020). ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models. Journal of Clinical Epidemiology, 126, 207–216. https://doi.org/10.1016/j.jclinepi.2020.01.028
Wang, C., Deng, C., & Wang, S. (2020). Imbalance-XGBoost: leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognition Letters, 136, 190–197. https://doi.org/10.1016/j.patrec.2020.05.035
Wang, S., Dai, Y., Shen, J., & Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-03430-5
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107. https://doi.org/10.1016/j.jbi.2020.103465
Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., & Si, Y. (2018). A Data-Driven Design for Fault Detection of Wind Turbines Using Random Forests and XGboost. IEEE Access, 6, 21020–21031. https://doi.org/10.1109/ACCESS.2018.2818678
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Novriadi Antonius Siagian, Sardo P Sipayung, Alex Rikki, Nasib Marbun

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.