Comparative analysis of resampling techniques on Machine Learning algorithm
DOI:
10.33395/sinkron.v7i2.11427Abstract
Generally, classification algorithms in the field of data science assume that the classes of training data are equally distributed. However, datasets on real problems often have an unbalanced class distribution. Unbalanced dataset classes make up the majority class and the minority class. In general, minority classes are more attractive and more important to identify. In this case, the correct classification for the minority class sample is more valuable than the majority class. The unbalanced class distribution causes the classification algorithm to have difficulty in classifying minority class samples correctly. If the performance of the algorithm model is good for the majority class sample but bad for the minority class then this imbalance problem is a crucial thing to be addressed. Many solutions are offered for this problem, namely by oversampling techniques in the minority class and/or undersampling techniques in the majority class. In this study, the authors tried various sampling techniques and tested them on various machine learning classification algorithms to find out the combination of resampling techniques and algorithms that have high recall in classifying minority class samples and still considering the majority class classification.
Downloads
References
Alpaydin, E. (2014). Introduction to Machine Learning (third edition).
Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., … Hussain, A. (2016). Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719
Anand, A., Pugalenthi, G., Fogel, G., & Suganthan, P. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39, 1385–1391. https://doi.org/10.1007/s00726-010-0595-2
Batista, G., Prati, R., & Monard, M.-C. (2004). A Study of the Behavior of Several Methods for Balancing machine Learning Training Data. SIGKDD Explorations, 6, 20–29. https://doi.org/10.1145/1007730.1007735
Bishop, C. M. (2021). Pattern Recognition and Machine Learning. In EAI/Springer Innovations in Communication and Computing. https://doi.org/10.1007/978-3-030-57077-4_11
Burnaev, E., Erofeev, P., & Papanov, A. (2015). Influence of resampling on accuracy of imbalanced classification. Eighth International Conference on Machine Vision (ICMV 2015), 9875, 987521. https://doi.org/10.1117/12.2228523
Diri, B., & Albayrak, S. (2008). Visualization and analysis of classifiers performance in multi-class medical data. Expert Systems with Applications, 34(1), 628–634. https://doi.org/https://doi.org/10.1016/j.eswa.2006.10.016
I., J. M., & M., M. T. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415
Liu, A. Y. (2004). The Effect of Oversampling and Understanding on CLassifying Imbalanced Text Datasets.
More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. 10000, 1–7. Retrieved from http://arxiv.org/abs/1608.06048
Pedro, D. (2012). A Few Useful Things to Know About Machine Learning. Communications of the ACM, 55(10), 9–48. Retrieved from https://dl.acm.org/citation.cfm?id=2347755
Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101 Extended Abstract.
Rahman, M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, 3, 224. https://doi.org/10.7763/IJMLC.2013.V3.307
Snijders, C., Matzat, U., & Reips, U.-D. (2012). “Big Data” : Big Gaps of Knowledge in the Field of Internet Science. International Journal of Internet Science, 7, 1–5.
Statistic Solutions. (2016). Resampling. Retrieved April 10, 2022, from statisticsolutions.com website: https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/resampling/
Visa, S., & Ralescu, A. (2005). Issues in Mining Imbalanced Data Sets - A Review Paper. Proc. 16th Midwest Artificial Intelligence and Cognitive Science Conference.
Yen, S.-J., & Lee, Y.-S. (2006). Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset BT - Intelligent Control and Automation: International Conference on Intelligent Computing, ICIC 2006 Kunming, China, August 16–19, 2006 (D.-S. Huang, K. Li, & G. W. Irwin, Eds.). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-37256-1_89
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2022 Tri Suci Amelia, Mila Nirmala Sari Hasibuan, Rahmadani Pane

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.