Comparative Study of XGBoost, Random Forest, and Logistic Regression Models for Predicting Customer Interest in Vehicle Insurance

Authors

  • Gregorius Airlangga Information System Study Program, Atma Jaya Catholic University of Indonesia, Indonesia

DOI:

10.33395/sinkron.v8i4.14194

Keywords:

Vehicle Insurance Prediction, Machine Learning, XGBoost, RandomForest, Logistic Regression

Abstract

In today’s competitive insurance market, accurately predicting customer interest in additional products, such as vehicle insurance, is crucial for optimizing marketing strategies and maximizing sales. This study presents a comparative analysis of three machine learning models such as XGBoost, RandomForest, and Logistic Regression to predict customer interest in vehicle insurance based on a dataset that includes demographic, vehicle, and policy-related features. The dataset was analyzed using five-fold cross-validation, and the performance of the models was evaluated using AUC-ROC, precision, recall, and F1-score. XGBoost demonstrated the highest recall (0.9525) and AUC-ROC (0.7854), making it the most effective model for identifying customers interested in vehicle insurance, though at the expense of lower precision (0.2585). RandomForest showed a more balanced trade-off between precision (0.3064) and recall (0.5341) but performed lower overall. Logistic Regression, while the most interpretable model, exhibited high variability in performance across different folds, with a lower average precision (0.2372). The findings of this research highlight that XGBoost is ideal for maximizing recall in high-volume campaigns, while RandomForest may be better suited for applications requiring fewer false positives. These results offer valuable insights into model selection based on business objectives and resource allocation.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Ali, M. S., Hossain, M. M., Kona, M. A., Nowrin, K. R., & Islam, M. K. (2024). An ensemble classification approach for cervical cancer prediction using behavioral risk factors. Healthcare Analytics, 5, 100324.

Antons, D., Breidbach, C. F., Joshi, A. M., & Salge, T. O. (2023). Computational literature reviews: Method, algorithms, and roadmap. Organizational Research Methods, 26(1), 107–138.

Bounab, R., Zarour, K., Guelib, B., & Khlifa, N. (2024). Enhancing Medicare Fraud Detection Through Machine Learning: Addressing Class Imbalance With SMOTE-ENN. IEEE Access.

Chaffey, D., & Smith, P. R. (2022). Digital marketing excellence: planning, optimizing and integrating online marketing. Routledge.

Chen, Q., Zhang, Z.-L., Huang, W.-P., Wu, J., & Luo, X.-G. (2022). PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing, 498, 75–88.

Dangut, M. D. (2021). Application of data analytics for predictive maintenance in aerospace: an approach to imbalanced learning.

Datta, S., Ghosh, C., & Choudhury, J. P. (2024). Classification of imbalanced datasets utilizing the synthetic minority oversampling method in conjunction with several machine learning techniques. Iran Journal of Computer Science, 1–18.

Esfandabadi, Z. S., Ranjbari, M., & Scagnelli, S. D. (2023). Prioritizing risk-level factors in comprehensive automobile insurance management: A hybrid multi-criteria decision-making Model. Global Business Review, 24(5), 972–989.

Groll, A., Wasserfuhr, C., & Zeldin, L. (2024). Churn Modeling of Life Insurance Policies Via Statistical and Machine Learning Methods. Journal of Insurance Issues, 47(1), 78–117.

Gupta, S., & Joshi, S. (2022). Predictive analytic techniques for enhancing marketing performance and personalized customer experience. 2022 International Interdisciplinary Humanitarian Conference for Sustainability (IIHC), 16–22.

Hanafy, M., & Ming, R. (2021). Machine learning approaches for auto insurance big data. Risks, 9(2), 42.

Hatwell, J., Gaber, M. M., & Azad, R. M. A. (2020). CHIRPS: Explaining random forest classification. Artificial Intelligence Review, 53, 5747–5788.

Hosein, P. (2024). A data science approach to risk assessment for automobile insurance policies. International Journal of Data Science and Analytics, 17(1), 127–138.

Joloudari, J. H., Marefat, A., Nematollahi, M. A., Oyelere, S. S., & Hussain, S. (2023). Effective class-imbalance learning based on SMOTE and convolutional neural networks. Applied Sciences, 13(6), 4006.

Kaswan, K. S., Dhatterwal, J. S., Sharma, H., & Sood, K. (2022). Big data in insurance innovation. Big Data: A Game Changer for Insurance Industry, 117–136.

Kiangala, S. K., & Wang, Z. (2021). An effective adaptive customization framework for small manufacturing plants using extreme gradient boosting-XGBoost and random forest ensemble learning algorithms in an Industry 4.0 environment. Machine Learning with Applications, 4, 100024.

Korkmaz, S. (2020). Deep learning-based imbalanced data classification for drug discovery. Journal of Chemical Information and Modeling, 60(9), 4180–4190.

Kotb, M. H., & Ming, R. (2021). Comparing SMOTE family techniques in predicting insurance premium defaulting using machine learning models. International Journal of Advanced Computer Science and Applications, 12(9).

Kumar, A. (2021). Health Insurance Cross Sell Prediction Dataset. Retrieved from https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction/data

Loftus, J. (2023). An assessment of the effectiveness of using data analytics to predict death claim seasonality and protection policy review lapses in a life insurance company.

Nguyen, H., Vu, T., Vo, T. P., & Thai, H.-T. (2021). Efficient machine learning models for prediction of concrete strengths. Construction and Building Materials, 266, 120950.

Obiora, C. N., Ali, A., & Hasan, A. N. (2021). Implementing extreme gradient boosting (xgboost) algorithm in predicting solar irradiance. 2021 IEEE PES/IAS PowerAfrica, 1–5.

Pradipta, G. A., Wardoyo, R., Musdholifah, A., & Sanjaya, I. N. H. (2021). Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data. IEEE Access, 9, 74763–74777.

Roy, K., Ahmad, M., Waqar, K., Priyaah, K., Nebhen, J., Alshamrani, S. S., … Ali, I. (2021). An enhanced machine learning framework for type 2 diabetes classification using imbalanced data with missing values. Complexity, 2021(1), 9953314.

Rusdah, D. A., & Murfi, H. (2020). XGBoost in handling missing values for life insurance risk prediction. SN Applied Sciences, 2(8), 1336.

Shakhovska, N., Melnykova, N., Chopiyak, V., & others. (2022). An Ensemble Methods for Medical Insurance Costs Prediction Task. Computers, Materials & Continua, 70(2).

Shin, Y., Kim, M., Kim, H., & others. (2024). Towards unbalanced multiclass intrusion detection with hybrid sampling methods and ensemble classification. Applied Soft Computing, 157, 111517.

Sikri, A., Jameel, R., Idrees, S. M., & Kaur, H. (2024). Enhancing customer retention in telecom industry with machine learning driven churn prediction. Scientific Reports, 14(1), 13097.

Singhal, N., Goyal, S., & Singhal, T. (2024). Potential, Risks, and Ethical Implications of Decentralized Insurance. Springer.

Śmietanka, M., Koshiyama, A., & Treleaven, P. (2021). Algorithms in future insurance markets. International Journal of Data Science and Big Data Analytics, 1(1), 1–19.

Staudt, Y., & Wagner, J. (2022). Factors Driving Duration to Cross-Selling in Non-Life Insurance: New Empirical Evidence from Switzerland. Risks, 10(10), 187.

Tian, X., Todorovic, J., & Todorovic, Z. (2023). A Machine-Learning-Based Business Analytical System for Insurance Customer Relationship Management and Cross-Selling. Journal of Applied Business & Economics, 25(6).

Tondi, M. (2024). THE RECONFIGURATION OF CUSTOMER VALUE PROPOSITION IN THE INSURANCE INDUSTRY.

Wang, C. (2022). Efficient customer segmentation in digital marketing using deep learning with swarm intelligence approach. Information Processing & Management, 59(6), 103085.

Washington, A. L. (2023). Ethical Data Science: Prediction in the Public Interest. Oxford University Press.

Xing, Q., Yu, C., Huang, S., Zheng, Q., Mu, X., & Sun, M. (2024). Enhanced Credit Score Prediction Using Ensemble Deep Learning Model. ArXiv Preprint ArXiv:2410.00256.

Yego, N. K. K., Nkurunziza, J., & Kasozi, J. (2023). Predicting health insurance uptake in Kenya using Random Forest: An analysis of socio-economic and demographic factors. Plos One, 18(11), e0294166.

Yego, N. K., Kasozi, J., & Nkurunziza, J. (2021). A comparative analysis of machine learning models for the prediction of insurance uptake in kenya. Data, 6(11), 116.

Zaghloul, M., Barakat, S., & Rezk, A. (2024). Predicting E-commerce customer satisfaction: Traditional machine learning vs. deep learning approaches. Journal of Retailing and Consumer Services, 79, 103865.

Downloads


Crossmark Updates

How to Cite

Airlangga, G. (2024). Comparative Study of XGBoost, Random Forest, and Logistic Regression Models for Predicting Customer Interest in Vehicle Insurance. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 8(4), 2542-2549. https://doi.org/10.33395/sinkron.v8i4.14194