Comparative analysis of resampling techniques on Machine Learning algorithm

Authors

  • Tri Suci Amelia Universitas Labuhanbatu
  • Mila Nirmala Sari Hasibuan Universitas Labuhanbatu, Indonesia
  • Rahmadani Pane Universitas Labuhanbatu, Indonesia

DOI:

10.33395/sinkron.v7i2.11427

Abstract

Generally, classification algorithms in the field of data science assume that the classes of training data are equally distributed. However, datasets on real problems often have an unbalanced class distribution. Unbalanced dataset classes make up the majority class and the minority class. In general, minority classes are more attractive and more important to identify. In this case, the correct classification for the minority class sample is more valuable than the majority class. The unbalanced class distribution causes the classification algorithm to have difficulty in classifying minority class samples correctly. If the performance of the algorithm model is good for the majority class sample but bad for the minority class then this imbalance problem is a crucial thing to be addressed. Many solutions are offered for this problem, namely by oversampling techniques in the minority class and/or undersampling techniques in the majority class. In this study, the authors tried various sampling techniques and tested them on various machine learning classification algorithms to find out the combination of resampling techniques and algorithms that have high recall in classifying minority class samples and still considering the majority class classification.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Alpaydin, E. (2014). Introduction to Machine Learning (third edition).

Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., … Hussain, A. (2016). Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719

Anand, A., Pugalenthi, G., Fogel, G., & Suganthan, P. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39, 1385–1391. https://doi.org/10.1007/s00726-010-0595-2

Batista, G., Prati, R., & Monard, M.-C. (2004). A Study of the Behavior of Several Methods for Balancing machine Learning Training Data. SIGKDD Explorations, 6, 20–29. https://doi.org/10.1145/1007730.1007735

Bishop, C. M. (2021). Pattern Recognition and Machine Learning. In EAI/Springer Innovations in Communication and Computing. https://doi.org/10.1007/978-3-030-57077-4_11

Burnaev, E., Erofeev, P., & Papanov, A. (2015). Influence of resampling on accuracy of imbalanced classification. Eighth International Conference on Machine Vision (ICMV 2015), 9875, 987521. https://doi.org/10.1117/12.2228523

Diri, B., & Albayrak, S. (2008). Visualization and analysis of classifiers performance in multi-class medical data. Expert Systems with Applications, 34(1), 628–634. https://doi.org/https://doi.org/10.1016/j.eswa.2006.10.016

I., J. M., & M., M. T. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415

Liu, A. Y. (2004). The Effect of Oversampling and Understanding on CLassifying Imbalanced Text Datasets.

More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. 10000, 1–7. Retrieved from http://arxiv.org/abs/1608.06048

Pedro, D. (2012). A Few Useful Things to Know About Machine Learning. Communications of the ACM, 55(10), 9–48. Retrieved from https://dl.acm.org/citation.cfm?id=2347755

Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101 Extended Abstract.

Rahman, M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, 3, 224. https://doi.org/10.7763/IJMLC.2013.V3.307

Snijders, C., Matzat, U., & Reips, U.-D. (2012). “Big Data” : Big Gaps of Knowledge in the Field of Internet Science. International Journal of Internet Science, 7, 1–5.

Statistic Solutions. (2016). Resampling. Retrieved April 10, 2022, from statisticsolutions.com website: https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/resampling/

Visa, S., & Ralescu, A. (2005). Issues in Mining Imbalanced Data Sets - A Review Paper. Proc. 16th Midwest Artificial Intelligence and Cognitive Science Conference.

Yen, S.-J., & Lee, Y.-S. (2006). Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset BT - Intelligent Control and Automation: International Conference on Intelligent Computing, ICIC 2006 Kunming, China, August 16–19, 2006 (D.-S. Huang, K. Li, & G. W. Irwin, Eds.). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-37256-1_89

Downloads


Crossmark Updates

How to Cite

Amelia, T. S., Hasibuan, M. N. S. ., & Pane, R. . (2022). Comparative analysis of resampling techniques on Machine Learning algorithm. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 6(2), 628-634. https://doi.org/10.33395/sinkron.v7i2.11427

Most read articles by the same author(s)