Identification of 10 Regional Indonesian Languages Using Machine Learning

Authors

  • Azhar Baihaqi Nugraha School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia
  • Ade Romadhony School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia

DOI:

10.33395/sinkron.v8i4.12989

Keywords:

Text Classification, Language Identification, Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes Classifier (NBC)

Abstract

Language Identification plays a pivotal role in deciphering the rich tapestry of Indonesia's diverse regional languages, encompassing a wide spectrum of scripts, and spoken forms. Language Identification, an integral component of Natural Language Processing, is frequently addressed through Text Classification. In this study, we embark on the task of identifying 10 Indonesian languages, leveraging the NusaX dataset, with the overarching objective of contextual language determination. To achieve this, we harness a diverse array of machine learning techniques, including Support Vector Machine, Naïve Bayes Classifier, Decision Tree, Rocchio Classification, Logistic Regression, and Random Forest. We complement these methods with two distinct feature extraction approaches: N-gram and TF-IDF. This comprehensive approach enables us to construct robust models for language identification. Our findings unveil the strong efficacy of these models in discerning Indonesian languages, with the Naïve Bayes Classifier emerging as the frontrunner, achieving an impressive accuracy rate of 99.2% with TF-IDF and an even more remarkable 99.4% with N-Gram. To gain deeper insights, we delve into error analysis, revealing that misclassifications often stem from shared words across different languages. This research is underpinned by the necessity for a robust language identification model, underscoring its critical role within the complex linguistic landscape of Indonesian regional languages. These results hold great promise for applications in automated language processing and understanding within this diverse and multifaceted linguistic context.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Aan Setyawan. 2011. “International_Proceeding_UNDIP_July__2,_2011_-_Aan_Setyawan.” Language Maintenance and Shift.

van Aken, Betty, Julian Risch, Ralf Krestel, and Alexander Löser. 2018. “Challenges for Toxic Comment Classification: An In-Depth Error Analysis.”

Andayani, U., D. Arisandi, Misbah Hasugian, M. F. Syahputra, and B. Siregar. 2019. “The English Language Scientific Literature Classification Based on Abstract Using Rocchio Algorithm.” in Journal of Physics: Conference Series. Vol. 1235. Institute of Physics Publishing.

Arif Ahmad. 2018. PENDETEKSI BAHASA DAERAH PADA TWITTER DENGAN MACHINE LEARNING.

Dr. Tri Wiratno, M. A., and M. Ed. ,. Ph. D. Drs. Riyadi Santosa. 2014. Bahasa, Fungsi Bahasa, Dan Konteks Sosial.

Hassan, Sayar Ul, Jameel Ahamed, and Khaleel Ahmad. 2022. “Analytics of Machine Learning-Based Algorithms for Text Classification.” Sustainable Operations and Computers 3:238–48. doi: 10.1016/J.SUSOC.2022.03.001.

Jauhiainen, Tommi, Krister Lindén, and Heidi Jauhiainen. 2017. Evaluation of Language Identification Methods Using 285 Languages.

Jauhiainen, Tommi, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2019. Automatic Language Identification in Texts: A Survey. Vol. 65.

Kowsari, Kamran, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. “Text Classification Algorithms: A Survey.” Information (Switzerland) 10(4).

Rocchio, J. J. 1971. “Relevance Feedback in Information Retrieval.” The Smart Retrieval System - Experiments in Automatic Document Processing 313–23.

Shah, Kanish, Henil Patel, Devanshi Sanghvi, and Manan Shah. 2020. “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification.” Augmented Human Research 2020 5:1 5(1):1–16. doi: 10.1007/S41133-020-00032-0.

Tondo, Fanny Henry. 2009. KEPUNAHAN BAHASA-BAHASA DAERAH: FAKTOR PENYEBAB DAN IMPLIKASI ETNOLINGUISTIS 1. Vol. 11.

Tuhenay, Deglorians, and Evangs Mailoa. 2021. “PERBANDINGAN KLASIFIKASI BAHASA MENGGUNAKAN METODE NAÏVE BAYES CLASSIFIER (NBC) DAN SUPPORT VECTOR MACHINE (SVM).” Jurnal Informatika Dan Komputer) Akreditasi KEMENRISTEKDIKTI 4(2). doi: 10.33387/jiko.

Vatanen, Tommi, Jaakko J. Väyrynen, and Sami Virpioja. 2010. Language Identification of Short Text Segments with N-Gram Models.

Wang, Dongyang, Junli Su, and Hongbin Yu. 2020. “Feature Extraction and Analysis of Natural Language Processing for Deep Learning English Language.” IEEE Access 8:46335–45. doi: 10.1109/ACCESS.2020.2974101.

Winata, Genta Indra, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2022. “NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages.”

Zaman, Badrus, Eva Hariyanti, and Endah Purwanti. 2015. Sistem Deteksi Bahasa Pada Dokumen Menggunakan N-Gram.

Downloads


Crossmark Updates

How to Cite

Nugraha, A. B., & Ade Romadhony. (2023). Identification of 10 Regional Indonesian Languages Using Machine Learning . Sinkron : Jurnal Dan Penelitian Teknik Informatika, 7(4), 2203-2214. https://doi.org/10.33395/sinkron.v8i4.12989