Comparing XGBoost and LightGBM for Optimizing Health Content Categories
DOI:
10.33395/sinkron.v10i1.15545Keywords:
Health Content Classification, IndoBERT, Indonesia Text Mining, LightGBM, XGBoostAbstract
Indonesia’s social media platforms contain large amounts of unverified health information. Research on Indonesian health-text mining still rarely focuses on disease-based classification, leaving a gap compared with studies that only address sentiment or general topic categorization. This study proposes a multi-class classification approach that uses IndoBERT embeddings combined with gradient-boosting classifiers (XGBoost and LightGBM) to categorize tweets into diabetes, hypertension, and heart disease. The dataset comprises 4,075 tweets collected from platform X (Twitter). Preprocessing involves text cleaning, anonymization, normalization, and the extraction of 768-dimensional IndoBERT embeddings. Experiments are conducted in Google Colab (Intel Xeon CPU, 13 GB RAM, optional NVIDIA T4 GPU) using stratified five-fold cross-validation.The best results are obtained by the IndoBERT × LightGBM pipeline, which achieves an accuracy of 0.8526 and a macro-averaged F1-score of 0.8527, outperforming the IndoBERT × XGBoost model (accuracy 0.8325 and macro F1-score 0.8326). Feature-importance analysis shows that contextual terms related to blood sugar, the heart, and blood pressure strongly influence the predictions. Overall, the proposed method provides an effective baseline for monitoring health-related text and supporting disease-oriented analytics in Indonesian-language social media.
Downloads
References
Ahn, J. M., Kim, J., & Kim, K. (2023). Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins, 15(10), 608. https://doi.org/10.3390/toxins15100608
Chen, M., Wu, Y., Wingerd, B., Liu, Z., Xu, J., Thakkar, S., Pedersen, T. J., Donnelly, T., Mann, N., Tong, W., Wolfinger, R. D., & Bao, W. (2024). Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1401810
Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Demirtürk, D., Mintemur, Ö., & Arslan, A. (2025). Optimizing LightGBM and XGBoost Algorithms for Estimating Compressive Strength in High-Performance Concrete. Arabian Journal for Science and Engineering. https://doi.org/10.1007/s13369-025-10217-7
Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9
Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944
Ranković, N., Ranković, D., Ivanović, M., & Lukić, I. (2024). Explainable data mining model for hyperinsulinemia diagnostics. Connection Science, 36(1), 2325496. https://doi.org/10.1080/09540091.2024.2325496
Hindarto, D., Afarini, N., Informatika, P., Informasi, P. S., & Luhur, U. B. (2023). COMPARISON EFFICACY OF VGG16 AND VGG19 INSECT CLASSIFICATION. 6(3), 189–195. https://doi.org/10.33387/jiko.v6i3.7008
Hindarto, D., Rachmadi, R. F., Hariadi, M., & Damastuti, F. A. (2025). Contextual Awareness System for Landslide Risk Recommendation in Crypto-Spatial. 2025 International Electronics Symposium (IES), 700–706. https://doi.org/10.1109/IES67184.2025.11161195
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 757–770). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.66
Suherman, E., Hindarto, D., Makmur, A., & Santoso, H. (2023). Comparison of Convolutional Neural Network and Artificial Neural Network for Rice Detection. Sinkron, 8(1), 247–255. https://doi.org/10.33395/sinkron.v8i1.11944
Pringandana, C. G. L., & Kusnawi. (2025). A comparative analysis of hyperparameter-tuned XGBoost and LightGBM for multiclass rainfall classification in Jakarta. Jurnal Teknik Informatika (JUTIF), 6(4), 2467–2483. https://doi.org/10.52436/1.jutif.2025.6.4.4965
Liu, Y., & Chen, Z. (2025). LightGBM-based human action recognition using sensors. Sensors, 25(12), 3704. https://doi.org/10.3390/s25123704
Kabir, J., & Chakraborty, A. (2024). Exploring Explainable Artificial Intelligence: A Comparative Analysis of Interpretability Techniques. IJARCCE, 13(3)
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Nanda Oktaviana, Andrianingsih

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.






















Moraref
PKP Index
Indonesia OneSearch
OCLC Worldcat
Index Copernicus
Scilit
