Comparing XGBoost and LightGBM for Optimizing Health Content Categories

Authors

  • Nanda Oktaviana Sistem Informasi, Fakultas Teknologi Komunikasi dan Informatika, Universitas Nasional
  • Andrianingsih Sistem Informasi, Fakultas Teknologi Komunikasi dan Informatika, Universitas Nasional

DOI:

10.33395/sinkron.v10i1.15545

Keywords:

Health Content Classification, IndoBERT, Indonesia Text Mining, LightGBM, XGBoost

Abstract

Indonesian social media platforms host a rapidly expanding flow of health-related information, much of it unverified and fragmented across major disease topics such as diabetes, heart disease, and hypertension. This study develops a supervised multi-class text-classification pipeline that integrates IndoBERT embeddings with LightGBM and XGBoost to identify the most effective model for disease-based health content categorization. Preprocessing includes data anonymization, normalization, tokenization, and contextual embedding extraction using pretrained IndoBERT; evaluation employs five-fold stratified cross-validation to maintain class balance. Performance is measured through accuracy, precision, recall, and macro F1-score supported by confusion matrices. Results show that IndoBERT × LightGBM achieves the highest accuracy (0.8526) and balanced macro F1 (0.85), outperforming IndoBERT × XGBoost (accuracy 0.8325, F1 0.81). Diagnostic results indicate that LightGBM’s leaf-wise boosting structure improves generalization on short, noisy Indonesian texts. Feature-importance analysis highlights contextual terms such as “blood sugar,” “heart,” and “blood pressure” as key linguistic indicators contributing to model predictions. The workflow provides an explainable, scalable baseline for health text monitoring, misinformation detection, and public-health analytics in low-resource language settings.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Ahn, J. M., Kim, J., & Kim, K. (2023). Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins, 15(10), 608. https://doi.org/10.3390/toxins15100608

Chen, M., Wu, Y., Wingerd, B., Liu, Z., Xu, J., Thakkar, S., Pedersen, T. J., Donnelly, T., Mann, N., Tong, W., Wolfinger, R. D., & Bao, W. (2024). Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1401810

Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Demirtürk, D., Mintemur, Ö., & Arslan, A. (2025). Optimizing LightGBM and XGBoost Algorithms for Estimating Compressive Strength in High-Performance Concrete. Arabian Journal for Science and Engineering. https://doi.org/10.1007/s13369-025-10217-7

Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195

Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9

Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944

Ranković, N., Ranković, D., Ivanović, M., & Lukić, I. (2024). Explainable data mining model for hyperinsulinemia diagnostics. Connection Science, 36(1), 2325496. https://doi.org/10.1080/09540091.2024.2325496

Hindarto, D., Afarini, N., Informatika, P., Informasi, P. S., & Luhur, U. B. (2023). COMPARISON EFFICACY OF VGG16 AND VGG19 INSECT CLASSIFICATION. 6(3), 189–195. https://doi.org/10.33387/jiko.v6i3.7008

Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195

Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 757–770). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.66

Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944

Pringandana, C. G. L., & Kusnawi. (2025). A comparative analysis of hyperparameter-tuned XGBoost and LightGBM for multiclass rainfall classification in Jakarta. Jurnal Teknik Informatika (JUTIF), 6(4), 2467–2483. https://doi.org/10.52436/1.jutif.2025.6.4.4965

Liu, Y., & Chen, Z. (2025). LightGBM-based human action recognition using sensors. Sensors, 25(12), 3704. https://doi.org/10.3390/s25123704

Kabir, J., & Chakraborty, A. (2024). Exploring Explainable Artificial Intelligence: A Comparative Analysis of Interpretability Techniques. IJARCCE, 13(3)


Crossmark Updates

How to Cite

Oktaviana, N. ., & Andrianingsih, A. (2026). Comparing XGBoost and LightGBM for Optimizing Health Content Categories. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 10(1). https://doi.org/10.33395/sinkron.v10i1.15545