Integrating SMOTE with XGBoost for Robust Classification on Imbalanced Datasets: A Dual-Domain Evaluation

Authors

  • Novriadi Antonius Siagian Universitas Katolik Santo Thomas, Indonesia
  • Sardo P Sipayung Universitas Katolik Santo Thomas, Indonesia
  • Alex Rikki Universitas Katolik Santo Thomas, Indonesia
  • Nasib Marbun Universitas Negeri Manado, Indonesia

DOI:

10.33395/sinkron.v9i3.15029

Keywords:

SMOTE, machine learning, Imbalanced Data, Classification, XGBoost

Abstract

Class imbalance is one of the main challenges in classification problems, as it can reduce the model's ability to accurately identify minority classes and negatively impact the overall reliability of predictions. In response to this problem, this study proposes an integrated approach combining SMOTE and XGBoost to improve classification performance on imbalanced data. This approach aims to evaluate the impact of oversampling techniques on prediction accuracy and model sensitivity to class distribution. The evaluation was conducted using two public datasets representing different domains and different amounts of data, namely Spambase and Diabetes, to assess the effectiveness and generalization of the applied approach. The experimental results show that this integrated model consistently outperforms traditional comparison algorithms, with an F1 score of 0.94 and ROC-AUC of 0.98 on the Spambase dataset and ROC-AUC of 0.83 on the Diabetes dataset, with a good balance between precision and recall. The 10-fold cross-validation technique was applied to ensure objective performance estimates free from random data splitting bias. Additionally, this study highlights the importance of selecting appropriate evaluation metrics in the context of imbalanced data, as single accuracy often provides a misleading performance picture. This study makes a significant contribution by providing a benchmark for comparing the effectiveness of SMOTE-XGBoost integration using two different datasets, accompanied by rigorous cross-validation. These findings reinforce the position of integrating data preprocessing strategies and ensemble learning as a competitive and adaptive solution for addressing class imbalance challenges in data-driven classification systems.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Abdul Bujang, S. D., Selamat, A., Krejcar, O., Mohamed, F., Cheng, L. K., Chiu, P. C., & Fujita, H. (2023). Imbalanced Classification Methods for Student Grade Prediction: A Systematic Literature Review. In IEEE Access (Vol. 11, pp. 1970–1989). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/ACCESS.2022.3225404

Adeoti Babajide Ebenezer1, B. O. K. (PhD) 2 , O. M. I. (2021). A Comprehensive Analysis of Handling Imbalanced Dataset. International Journal of Advanced Trends in Computer Science and Engineering, 10(2), 454–463. https://doi.org/10.30534/ijatcse/2021/031022021

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-August-2016, 785–794. https://doi.org/10.1145/2939672.2939785

Dablain, D., Krawczyk, B., & Chawla, N. V. (2023). DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Transactions on Neural Networks and Learning Systems, 34(9), 6390–6404. https://doi.org/10.1109/TNNLS.2021.3136503

Devan, P., & Khare, N. (2020). An efficient XGBoost–DNN-based classification model for network intrusion detection system. Neural Computing and Applications, 32(16), 12499–12514. https://doi.org/10.1007/s00521-020-04708-x

Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Information Sciences, 501, 118–135. https://doi.org/10.1016/j.ins.2019.06.007

Elreedy, D., & Atiya, A. F. (2019). A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Information Sciences, 505, 32–64. https://doi.org/10.1016/j.ins.2019.07.070

Halim, A. M., Dwifebri, M., & Nhita, F. (2023). Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets. Building of Informatics, Technology and Science (BITS), 5(1). https://doi.org/10.47065/bits.v5i1.3647

Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123. https://doi.org/10.1007/s10994-009-5119-5

Hasanah, U., Soleh, A. M., & Sadik, K. (2024). Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models. Jurnal Matematika, Statistika Dan Komputasi, 21(1), 88–102. https://doi.org/10.20956/j.v21i1.35552

Husain, G., Nasef, D., Jose, R., Mayer, J., Bekbolatova, M., Devine, T., & Toma, M. (2025). SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models. Algorithms, 18(1). https://doi.org/10.3390/a18010037

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R Second Edition.

Jiang, J., Zhang, C., Ke, L., Hayes, N., Zhu, Y., Qiu, H., Zhang, B., Zhou, T., & Wei, G. W. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. In Chemical Science (Vol. 16, Issue 18, pp. 7637–7658). Royal Society of Chemistry. https://doi.org/10.1039/d5sc00270b

Kim, B., & Kim, J. (2020). Adjusting decision boundary for class imbalanced learning. IEEE Access, 8, 81674–81685. https://doi.org/10.1109/ACCESS.2020.2991231

Nemade, B., Bharadi, V., Alegavi, S. S., & Marakarkandy, B. (n.d.). International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING A Comprehensive Review: SMOTE-Based Oversampling Methods for Imbalanced Classification Techniques, Evaluation, and Result Comparisons. In Original Research Paper International Journal of Intelligent Systems and Applications in Engineering IJISAE (Vol. 2023, Issue 9s). www.ijisae.org

Nobre, J., & Neves, R. F. (2019). Combining Principal Component Analysis, Discrete Wavelet Transform and XGBoost to trade in the financial markets. Expert Systems with Applications, 125, 181–194. https://doi.org/10.1016/j.eswa.2019.01.083

Sharma, A., Singh, P. K., & Chandra, R. (2022). SMOTified-GAN for Class Imbalanced Pattern Classification Problems. IEEE Access, 10, 30655–30665. https://doi.org/10.1109/ACCESS.2022.3158977

Suandi, F., Anam, M. K., Firdaus, M. B., Fadli, S., Lathifah, L., Yumami, E., Saleh, A., & Hasibuan, A. Z. (2024). Enhancing Sentiment Analysis Performance Using SMOTE and Majority Voting in Machine Learning Algorithms (pp. 126–138). https://doi.org/10.2991/978-94-6463-620-8_10

Sugihartono, T., Wijaya, B., Marini, Alkayes, A. F., & Anugrah, H. A. (2025). Optimizing Stunting Detection through SMOTE and Machine Learning: a Comparative Study of XGBoost, Random Forest, SVM, and k-NN. Journal of Applied Data Sciences, 6(1), 667–682. https://doi.org/10.47738/jads.v6i1.494

Sun, J., Li, H., Fujita, H., Fu, B., & Ai, W. (2020). Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Information Fusion, 54, 128–144. https://doi.org/10.1016/j.inffus.2019.07.006

Ujaran, K., Ridwan, K., Heni Hermaliani, E., Ernawati, M., & Author, C. (2024). Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada. In Computer Science (CO-SCIENCE (Vol. 4, Issue 1). http://jurnal.bsi.ac.id/index.php/co-science

Verbakel, J. Y., Steyerberg, E. W., Uno, H., De Cock, B., Wynants, L., Collins, G. S., & Van Calster, B. (2020). ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models. Journal of Clinical Epidemiology, 126, 207–216. https://doi.org/10.1016/j.jclinepi.2020.01.028

Wang, C., Deng, C., & Wang, S. (2020). Imbalance-XGBoost: leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognition Letters, 136, 190–197. https://doi.org/10.1016/j.patrec.2020.05.035

Wang, S., Dai, Y., Shen, J., & Xuan, J. (2021). Research on expansion and classification of imbalanced data based on SMOTE algorithm. Scientific Reports, 11(1). https://doi.org/10.1038/s41598-021-03430-5

Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107. https://doi.org/10.1016/j.jbi.2020.103465

Zhang, D., Qian, L., Mao, B., Huang, C., Huang, B., & Si, Y. (2018). A Data-Driven Design for Fault Detection of Wind Turbines Using Random Forests and XGboost. IEEE Access, 6, 21020–21031. https://doi.org/10.1109/ACCESS.2018.2818678

Downloads


Crossmark Updates

How to Cite

Siagian, N. A. ., Sipayung, S. P., Alex Rikki, & Marbun, N. (2025). Integrating SMOTE with XGBoost for Robust Classification on Imbalanced Datasets: A Dual-Domain Evaluation. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 9(3), 1094-1107. https://doi.org/10.33395/sinkron.v9i3.15029