Comparing XGBoost and LightGBM for Optimizing Health Content Categories
DOI:
10.33395/sinkron.v10i1.15545Keywords:
Health Content Classification, IndoBERT, Indonesia Text Mining, LightGBM, XGBoostAbstract
Indonesian social media platforms host a rapidly expanding flow of health-related information, much of it unverified and fragmented across major disease topics such as diabetes, heart disease, and hypertension. This study develops a supervised multi-class text-classification pipeline that integrates IndoBERT embeddings with LightGBM and XGBoost to identify the most effective model for disease-based health content categorization. Preprocessing includes data anonymization, normalization, tokenization, and contextual embedding extraction using pretrained IndoBERT; evaluation employs five-fold stratified cross-validation to maintain class balance. Performance is measured through accuracy, precision, recall, and macro F1-score supported by confusion matrices. Results show that IndoBERT × LightGBM achieves the highest accuracy (0.8526) and balanced macro F1 (0.85), outperforming IndoBERT × XGBoost (accuracy 0.8325, F1 0.81). Diagnostic results indicate that LightGBM’s leaf-wise boosting structure improves generalization on short, noisy Indonesian texts. Feature-importance analysis highlights contextual terms such as “blood sugar,” “heart,” and “blood pressure” as key linguistic indicators contributing to model predictions. The workflow provides an explainable, scalable baseline for health text monitoring, misinformation detection, and public-health analytics in low-resource language settings.
Downloads
References
Ahn, J. M., Kim, J., & Kim, K. (2023). Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins, 15(10), 608. https://doi.org/10.3390/toxins15100608
Chen, M., Wu, Y., Wingerd, B., Liu, Z., Xu, J., Thakkar, S., Pedersen, T. J., Donnelly, T., Mann, N., Tong, W., Wolfinger, R. D., & Bao, W. (2024). Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1401810
Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Demirtürk, D., Mintemur, Ö., & Arslan, A. (2025). Optimizing LightGBM and XGBoost Algorithms for Estimating Compressive Strength in High-Performance Concrete. Arabian Journal for Science and Engineering. https://doi.org/10.1007/s13369-025-10217-7
Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9
Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944
Ranković, N., Ranković, D., Ivanović, M., & Lukić, I. (2024). Explainable data mining model for hyperinsulinemia diagnostics. Connection Science, 36(1), 2325496. https://doi.org/10.1080/09540091.2024.2325496
Hindarto, D., Afarini, N., Informatika, P., Informasi, P. S., & Luhur, U. B. (2023). COMPARISON EFFICACY OF VGG16 AND VGG19 INSECT CLASSIFICATION. 6(3), 189–195. https://doi.org/10.33387/jiko.v6i3.7008
Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 757–770). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.66
Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944
Pringandana, C. G. L., & Kusnawi. (2025). A comparative analysis of hyperparameter-tuned XGBoost and LightGBM for multiclass rainfall classification in Jakarta. Jurnal Teknik Informatika (JUTIF), 6(4), 2467–2483. https://doi.org/10.52436/1.jutif.2025.6.4.4965
Liu, Y., & Chen, Z. (2025). LightGBM-based human action recognition using sensors. Sensors, 25(12), 3704. https://doi.org/10.3390/s25123704
Kabir, J., & Chakraborty, A. (2024). Exploring Explainable Artificial Intelligence: A Comparative Analysis of Interpretability Techniques. IJARCCE, 13(3)
How to Cite
Issue
Section
License
Copyright (c) 2025 Nanda Oktaviana, Andrianingsih

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Moraref
PKP Index
Indonesia OneSearch
OCLC Worldcat
Index Copernicus
Scilit




















