Integrating Bayesian Optimization into Ensemble Logistic Regression for Explainable AI-Based Customer Behavior Analysis

Authors

  • Jeffry Institut Teknologi Bacharuddin Jusuf Habibie
  • Azminuddin I. S. Azis Institut Teknologi Bacharuddin Jusuf Habibie
  • Elisabeth Tri Juliana Kandakon Institut Teknologi Bacharuddin Jusuf Habibie

DOI:

10.33395/sinkron.v9i4.15219

Keywords:

Customer Behavior, Ensemble Logistic Regression, Bayesian Optimization, Explainable AI, SHAP, Automotive Industry

Abstract

Understanding customer behavior is a strategic factor in business decision-making, particularly within the automotive sector, where competition is intense and product variety is diverse. While previous studies often rely on limited demographic variables, such as age and gender, this research advances the field by integrating ensemble logistic regression with Bayesian Optimization for hyperparameter tuning and SHAP-based interpretability. The proposed model incorporates additional features beyond demographics, including vehicle category, product type, vehicle year, dealer branch, and transaction source, to enhance predictive accuracy. The methodology involves data preprocessing through encoding and cleaning, class balancing using SMOTE combined with undersampling, and stratified train-test splitting (80:20). Baseline Logistic Regression achieved an accuracy of 80%, ROC AUC of 0.89, precision of 0.47/0.96, recall of 0.84/0.79, and F1-scores of 0.59/0.89. By applying ensemble logistic regression with Bayesian Optimization, performance improved to 84% accuracy, ROC AUC of 0.92, precision of 0.51/0.98, recall of 0.83/0.84, and F1-scores of 0.63/0.92. SHAP analysis confirmed that the additional features significantly contribute to prediction outcomes. The novelty of this study lies in combining Ensemble Logistic Regression with Bayesian Optimization and SHAP explainability in the automotive domain, offering not only improved accuracy but also interpretability and fairness for business decision-making, providing actionable insights for targeted marketing strategies and product management. Future studies may incorporate broader behavioral and transactional variables to capture more nuanced customer decision patterns..

GS Cited Analysis

Downloads

Download data is not yet available.

References

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework (No. arXiv:1907.10902). arXiv. https://doi.org/10.48550/arXiv.1907.10902

Cendani, L. M., & Wibowo, A. (2022). Perbandingan Metode Ensemble Learning pada Klasifikasi Penyakit Diabetes. Jurnal Masyarakat Informatika, 13(1), 33–44. https://doi.org/10.14710/jmasif.13.1.42912

Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: Latest research, applications and future directions. Artificial Intelligence Review, 57(6), 137. https://doi.org/10.1007/s10462-024-10759-6

Danaher, D., Neale, W., McDonough, S., & Donaldson, D. (2019, April 2). Low Speed Override of Passenger Vehicles with Heavy Trucks. WCX SAE World Congress Experience. https://doi.org/10.4271/2019-01-0430

Frazier, P. I. (2018). A Tutorial on Bayesian Optimization (No. arXiv:1807.02811). arXiv. https://doi.org/10.48550/arXiv.1807.02811

GAIKINDO, G. (2023). Jumlah Kendaraan di Indonesia 147 Juta Unit, 60 Persen di Pulau Jawa – GAIKINDO. https://www.gaikindo.or.id/jumlah-kendaraan-di-indonesia-147-juta-unit-60-persen-di-pulau-jawa/

Ganaie, M. A., Hu, M., Malik, A. K., Tanveer, M., & Suganthan, P. N. (2022). Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115, 105151. https://doi.org/10.1016/j.engappai.2022.105151

González, S., García, S., Del Ser, J., Rokach, L., & Herrera, F. (2020). A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Information Fusion, 64, 205–237. https://doi.org/10.1016/j.inffus.2020.07.007

Gupta, N., Smith, J., Adlam, B., & Mariet, Z. (2022). Ensembling over Classifiers: A Bias-Variance Perspective (No. arXiv:2206.10566). arXiv. https://doi.org/10.48550/arXiv.2206.10566

Hariyanti, S., & Kristanti, D. (2024). Digital Transformation in MSMEs: An Overview of Challenges and Opportunities in Adopting Digital Technology. Jurnal Manajemen Bisnis, Akuntansi Dan Keuangan, 3(1), Article 1. https://doi.org/10.55927/jambak.v3i1.8766

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

Hillel, T., Bierlaire, M., Elshafie, M. Z. E. B., & Jin, Y. (2021). A systematic review of machine learning classification methodologies for modelling passenger mode choice. Journal of Choice Modelling, 38, 100221. https://doi.org/10.1016/j.jocm.2020.100221

Jafarzadeh, H., Mahdianpari, M., Gill, E., Mohammadimanesh, F., & Homayouni, S. (2021). Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation. Remote Sensing, 13(21), Article 21. https://doi.org/10.3390/rs13214405

Jeffry, J., Usman, S., & Aziz, F. (2023). Analisis Perilaku Pelanggan menggunakan Metode Ensemble Logistic Regression. JURNAL TEKNOLOGI DAN ILMU KOMPUTER PRIMA (JUTIKOMP), 6(2), 90–97.

Ju, J., Lee, E., & Park, S. (2024). Comparative Analysis of Ensemble Machine Learning Models for Personalized In-Vehicle Infotainment Recommendation Systems. Adjunct Proceedings of the 16th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, 45–50. https://doi.org/10.1145/3641308.3685021

Khan, A. A., Chaudhari, O., & Chandra, R. (2023). A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation (No. arXiv:2304.02858). arXiv. https://doi.org/10.48550/arXiv.2304.02858

Kotsiantis, S. B. (2014). Bagging and boosting variants for handling classifications problems: A survey. The Knowledge Engineering Review, 29(1), 78–100. https://doi.org/10.1017/S0269888913000313

Lee, M. (2024). COMPARISON OF BAGGING, BOOSTING, AND STACKING ENSEMBLE MODELS FOR AIRLINE CUSTOMER SATISFACTION ANALYSIS. FaST - Jurnal Sains Dan Teknologi (Journal of Science and Technology), 8(1), Article 1. https://doi.org/10.19166/jstfast.v8i1.8166

Levy, J. J., & O’Malley, A. J. (2020). Don’t dismiss logistic regression: The case for sensible extraction of interactions in the era of machine learning. BMC Medical Research Methodology, 20(1), 171. https://doi.org/10.1186/s12874-020-01046-3

Liu, S., Lun Ong, M., Kin Mun, K., Yao, J., & Motani, M. (2019, December 30). Early Prediction of Sepsis via SMOTE Upsampling and Mutual Information Based Downsampling. 2019 Computing in Cardiology Conference. https://doi.org/10.22489/CinC.2019.239

Lundberg, S., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions (No. arXiv:1705.07874). arXiv. https://doi.org/10.48550/arXiv.1705.07874

Manley, E., & Cheng, T. (2018). Exploring the role of spatial cognition in predicting urban traffic flow through agent-based modelling. Transportation Research Part A: Policy and Practice, 109, 14–23. https://doi.org/10.1016/j.tra.2018.01.020

Meysami, M., Kumar, V., Pugh, M., Lowery, S. T., Sur, S., Mondal, S., & Greene, J. M. (2023). Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: A case study in vitamin D and cancer incidence. Frontiers in Oncology, 13, 1227842. https://doi.org/10.3389/fonc.2023.1227842

Mohanty, P. K., Francis, S. A. J., Barik, R. K., Roy, D. S., & Saikia, M. J. (2024). Leveraging Shapley Additive Explanations for Feature Selection in Ensemble Models for Diabetes Prediction. Bioengineering, 11(12), 1215. https://doi.org/10.3390/bioengineering11121215

Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D., & Groh, G. (2022). SHAP-Based Explanation Methods: A Review for NLP Interpretability. In N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, & S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics (pp. 4593–4603). International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.406/

Ponce-Bobadilla, A. V., Schmitt, V., Maier, C. S., Mensing, S., & Stodtmann, S. (2024). Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clinical and Translational Science, 17(11), e70056. https://doi.org/10.1111/cts.70056

Runge, V. (2018). On the Limit Imbalanced Logistic Regression by Binary Predictors (No. arXiv:1703.08995). arXiv. https://doi.org/10.48550/arXiv.1703.08995

Salehi, F., Abbasi, E., & Hassibi, B. (2019). The Impact of Regularization on High-dimensional Logistic Regression (No. arXiv:1906.03761). arXiv. https://doi.org/10.48550/arXiv.1906.03761

Saran, N. A., & Nar, F. (2025). Fast binary logistic regression. PeerJ Computer Science, 11, e2579. https://doi.org/10.7717/peerj-cs.2579

Sen, D., Sachs, M., Lu, J., & Dunson, D. (2024). Efficient posterior sampling for high-dimensional imbalanced logistic regression (No. arXiv:1905.11232). arXiv. https://doi.org/10.48550/arXiv.1905.11232

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. Advances in Neural Information Processing Systems, 25. https://papers.nips.cc/paper_files/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html

Tama, I. P., Tantrika, C. F. M., Hardiningtyas, D., & Mohamad, E. (2021). Review of Industry 4.0 Strategy and Organization Readiness Level of Automotive SMEâ€TMs in Indonesia. APMBA (Asia Pacific Management and Business Application), 9(3), 313–324. https://doi.org/10.21776/ub.apmba.2021.009.03.9

Tawil, A.-R., Mohamed, M., Schmoor, X., Vlachos, K., & Haidar, D. (2023). Trends and Challenges Towards an Effective Data-Driven Decision Making in UK SMEs: Case Studies and Lessons Learnt from the Analysis of 85 SMEs (No. arXiv:2305.15454). arXiv. https://doi.org/10.48550/arXiv.2305.15454

Wang, H., Liang, Q., Hancock, J. T., & Khoshgoftaar, T. M. (2024). Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods. Journal of Big Data, 11(1), 44. https://doi.org/10.1186/s40537-024-00905-w

Yang, C., Fridgeirsson, E. A., Kors, J. A., Reps, J. M., & Rijnbeek, P. R. (2024). Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. Journal of Big Data, 11(1), 7. https://doi.org/10.1186/s40537-023-00857-7

Zhang, L., Geisler, T., Ray, H., & Xie, Y. (2021). Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. Journal of Applied Statistics, 49(13), 3257–3277. https://doi.org/10.1080/02664763.2021.1939662

Zhang, L., Ray, H., Priestley, J., & Tan, S. (n.d.). A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data. Journal of Applied Statistics, 47(3), 568–581. https://doi.org/10.1080/02664763.2019.1643829

Zohra Sbai, T. V. (2025). Bayesian Optimized Boosted Ensemble models for HR Analytics—Adopting Green Human Resource Management Practices. International Journal of Technology, 16(2), 291–319. https://doi.org/10.14716/ijtech.v16i2.7277

Downloads


Crossmark Updates

How to Cite

Jeffry, J., Azis, A. I. S. ., & Kandakon, E. T. J. . (2025). Integrating Bayesian Optimization into Ensemble Logistic Regression for Explainable AI-Based Customer Behavior Analysis. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 9(4), 1900-1911. https://doi.org/10.33395/sinkron.v9i4.15219