Stratified K-fold cross validation optimization on machine learning for prediction


  • Slamet Widodo Universitas Bina Sarana Informatika
  • Herlambang Brawijaya Universitas Bina Sarana Informatika
  • Samudi Samudi Universitas Nusa Mandiri




Cervical is the second most common malignant tumor in women, with 341,000 deaths worldwide in 2020, almost 80% of which occur in developing countries. One of the causes is infection with Human papillomavirus (HPV) types 16 and 18. The increasing incidence of cervical cancer in Indonesia makes this disease must be treated seriously because it is one of the main causes of death. In addition to the virus, external factors can be one of the causes. The high mortality rate in patients is caused by the patient's awareness of the emergence of cervical cancer which is only seen when it enters the final stage. One of the efforts to reduce the number of sufferers is to implement cervical cancer detection. Early detection of cervical cancer can also be identified by looking at external factors, such as behavioral factors, intentions, attitudes, norms, perceptions, motivations, social support, and empowerment. However, the data used has an imbalance in the distribution of the target class, namely more negative samples than positive ones. To overcome this, a technique called Stratified K-Fold Cross-Validation (SKCV) is used. Evaluation of the accuracy value using the Confusion matrix to determine the performance of each model. The best performance of the five classification algorithms used is 96 percent (RF), 94 percent (LR), 92 percent (XGBoost), 90 percent (KNN), and 88 percent (NB). The results show that the model formed by RF-based SKCV has the highest accuracy of other models.

GS Cited Analysis


Download data is not yet available.


Abdualgalil, Bilal, Sajimon Abraham, and Waleed M. Ismael. 2022. “An Efficient Machine Learning Techniques as Soft Diagnostic for Tuberculosis Classification Based on Clinical Data.” Journal of Scientific Research 66(02):61–67. doi: 10.37398/jsr.2022.660209.

Allen, John, Haipeng Liu, Sadaf Iqbal, Dingchang Zheng, and Gerard Stansby. 2021. “Deep Learning-Based Photoplethysmography Classification for Peripheral Arterial Disease Detection: A Proof-of-Concept Study.” Physiological Measurement 42(5). doi: 10.1088/1361-6579/abf9f3.

Chen, Chien-Jen, San-Lin You, Wan-Lun Hsu, Hwai-I. Yang, Mei-Hsuan Lee, Hui-Chi Chen, Yun-Yuan Chen, Jessica Liu, Hui-Han Hu, Yu-Ju Lin, Yu-Ju Chu, Yen-Tsung Huang, Chun-Ju Chiang, and Yin-Chu Chien. 2021. Epidemiology of Virus Infection and Human Cancer BT - Viruses and Human Cancer: From Basic Science to Clinical Prevention.

Cleophas, Ton J., and Aeilko H. Zwinderman. 2018. Regression Analysis in Medical Research. Cham: Springer International Publishing.

Genuer, Robin, and Jean-Michel Poggi. 2020. Random Forests with R. Cham: Springer International Publishing.

Gorunescu, Florin. 2011. Data Mining. Vol. 12. Berlin, Heidelberg: Springer Berlin Heidelberg.

Guimarães, Yasmin Medeiros, Luani Rezende Godoy, Adhemar Longatto-Filho, and Ricardo Dos Reis. 2022. “Management of Early-Stage Cervical Cancer: A Literature Review.” Cancers 14(3). doi: 10.3390/cancers14030575.

Kusuma, Edi Jaya, Ririn Nurmandhani, and Sri Handayani. 2021. “JPKM Jurnal Profesi Kesehatan Masyarakat.” Jpkm 2(1):1–8.

Larose, Daniel T., and Chantal D. Larose. 2015. Data Mining and Predictive Analytics. second. New Jersey: Wiley.

Mo, Hao, Hejiang Sun, Junjie Liu, and Shen Wei. 2019. “Developing Window Behavior Models for Residential Buildings Using XGBoost Algorithm.” Energy and Buildings 205:1–23. doi: 10.1016/j.enbuild.2019.109564.

Prusty, Sashikanta, Srikanta Patnaik, and Sujit Kumar Dash. 2022. “SKCV: Stratified K-Fold Cross-Validation on ML Classifiers for Predicting Cervical Cancer.” Frontiers in Nanotechnology 4. doi: 10.3389/fnano.2022.972421.

Razali, Nazim, Salama A. Mostafa, Aida Mustapha, Mohd Helmy Abd Wahab, and Nurul Atieqah Ibrahim. 2020. “Risk Factors of Cervical Cancer Using Classification in Data Mining.” Journal of Physics: Conference Series 1529(2). doi: 10.1088/1742-6596/1529/2/022102.

Sarker, Iqbal H. 2021. “Machine Learning: Algorithms, Real-World Applications and Research Directions.” SN Computer Science 2(3):1–21. doi: 10.1007/s42979-021-00592-x.

Setianingsih, Eka, Yuli Astuti, and Noveri Aisyaroh. 2022. “Literature Review : Faktor-Faktor Yang Mempengaruhi Terjadinya Kanker Serviks.” Jurnal Ilmiah PANNMED (Pharmacist, Analyst, Nurse, Nutrition, Midwivery, Environment, Dentist) 17(1):47–54. doi: 10.36911/pannmed.v17i1.1231.

Sobar, Rizanda Machmud, and Adi Wijaya. 2016. “Behavior Determinant Based Cervical Cancer Early Detection with Machine Learning Algorithm.” Advanced Science Letters 22(10):3120–23. doi: 10.1166/asl.2016.7980.

Speiser, Jaime Lynn, Michael E. Miller, Janet Tooze, and Edward Ip. 2019. “A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.” Expert Systems with Applications 134(336):93–101. doi: 10.1016/j.eswa.2019.05.028.

Sung, Hyuna, Jacques Ferlay, Rebecca L. Siegel, Mathieu Laversanne, Isabelle Soerjomataram, Ahmedin Jemal, and Freddie Bray. 2021. “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.” CA: A Cancer Journal for Clinicians 71(3):209–49. doi: 10.3322/caac.21660.

The Global Cancer Observatory. 2020. Cancer Incident in Indonesia. Vol. 858.

Wade, Corey. 2020. Hands-On Gradient Boosting with XGBoost and Scikit-Learn.

Wang, Xiaoyun, Sufang Wu, and Yanli Li. 2021. “Risks for Cervical Abnormalities in Women with Non-16/18 High-Risk Human Papillomavirus Infections in South Shanghai, China.” Journal of Medical Virology 93(11):6355–61. doi: 10.1002/jmv.27185.

Witten, Ian H. 2017. Data Mining (Fourth Edition).


Crossmark Updates

How to Cite

Widodo, S., Brawijaya, H., & Samudi, S. (2022). Stratified K-fold cross validation optimization on machine learning for prediction. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 7(4), 2407-2414.