Identification of 10 Regional Indonesian Languages Using Machine Learning

Azhar Baihaqi Nugraha; Ade Romadhony

doi:10.33395/sinkron.v8i4.12989

Authors

Azhar Baihaqi Nugraha School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia
Ade Romadhony School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia

DOI:

10.33395/sinkron.v8i4.12989

Keywords:

Text Classification, Language Identification, Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes Classifier (NBC)

Abstract

Language Identification plays a pivotal role in deciphering the rich tapestry of Indonesia's diverse regional languages, encompassing a wide spectrum of scripts, and spoken forms. Language Identification, an integral component of Natural Language Processing, is frequently addressed through Text Classification. In this study, we embark on the task of identifying 10 Indonesian languages, leveraging the NusaX dataset, with the overarching objective of contextual language determination. To achieve this, we harness a diverse array of machine learning techniques, including Support Vector Machine, Naïve Bayes Classifier, Decision Tree, Rocchio Classification, Logistic Regression, and Random Forest. We complement these methods with two distinct feature extraction approaches: N-gram and TF-IDF. This comprehensive approach enables us to construct robust models for language identification. Our findings unveil the strong efficacy of these models in discerning Indonesian languages, with the Naïve Bayes Classifier emerging as the frontrunner, achieving an impressive accuracy rate of 99.2% with TF-IDF and an even more remarkable 99.4% with N-Gram. To gain deeper insights, we delve into error analysis, revealing that misclassifications often stem from shared words across different languages. This research is underpinned by the necessity for a robust language identification model, underscoring its critical role within the complex linguistic landscape of Indonesian regional languages. These results hold great promise for applications in automated language processing and understanding within this diverse and multifaceted linguistic context.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Aan Setyawan. 2011. “International_Proceeding_UNDIP_July__2,_2011_-_Aan_Setyawan.” Language Maintenance and Shift.

van Aken, Betty, Julian Risch, Ralf Krestel, and Alexander Löser. 2018. “Challenges for Toxic Comment Classification: An In-Depth Error Analysis.”

Andayani, U., D. Arisandi, Misbah Hasugian, M. F. Syahputra, and B. Siregar. 2019. “The English Language Scientific Literature Classification Based on Abstract Using Rocchio Algorithm.” in Journal of Physics: Conference Series. Vol. 1235. Institute of Physics Publishing.

Arif Ahmad. 2018. PENDETEKSI BAHASA DAERAH PADA TWITTER DENGAN MACHINE LEARNING.

Dr. Tri Wiratno, M. A., and M. Ed. ,. Ph. D. Drs. Riyadi Santosa. 2014. Bahasa, Fungsi Bahasa, Dan Konteks Sosial.

Hassan, Sayar Ul, Jameel Ahamed, and Khaleel Ahmad. 2022. “Analytics of Machine Learning-Based Algorithms for Text Classification.” Sustainable Operations and Computers 3:238–48. doi: 10.1016/J.SUSOC.2022.03.001.

Jauhiainen, Tommi, Krister Lindén, and Heidi Jauhiainen. 2017. Evaluation of Language Identification Methods Using 285 Languages.

Jauhiainen, Tommi, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Lindén. 2019. Automatic Language Identification in Texts: A Survey. Vol. 65.

Kowsari, Kamran, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. “Text Classification Algorithms: A Survey.” Information (Switzerland) 10(4).

Rocchio, J. J. 1971. “Relevance Feedback in Information Retrieval.” The Smart Retrieval System - Experiments in Automatic Document Processing 313–23.

Shah, Kanish, Henil Patel, Devanshi Sanghvi, and Manan Shah. 2020. “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification.” Augmented Human Research 2020 5:1 5(1):1–16. doi: 10.1007/S41133-020-00032-0.

Tondo, Fanny Henry. 2009. KEPUNAHAN BAHASA-BAHASA DAERAH: FAKTOR PENYEBAB DAN IMPLIKASI ETNOLINGUISTIS 1. Vol. 11.

Tuhenay, Deglorians, and Evangs Mailoa. 2021. “PERBANDINGAN KLASIFIKASI BAHASA MENGGUNAKAN METODE NAÏVE BAYES CLASSIFIER (NBC) DAN SUPPORT VECTOR MACHINE (SVM).” Jurnal Informatika Dan Komputer) Akreditasi KEMENRISTEKDIKTI 4(2). doi: 10.33387/jiko.

Vatanen, Tommi, Jaakko J. Väyrynen, and Sami Virpioja. 2010. Language Identification of Short Text Segments with N-Gram Models.

Wang, Dongyang, Junli Su, and Hongbin Yu. 2020. “Feature Extraction and Analysis of Natural Language Processing for Deep Learning English Language.” IEEE Access 8:46335–45. doi: 10.1109/ACCESS.2020.2974101.

Winata, Genta Indra, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2022. “NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages.”

Zaman, Badrus, Eva Hariyanti, and Endah Purwanti. 2015. Sistem Deteksi Bahasa Pada Dokumen Menggunakan N-Gram.

	CONTACT US
	EDITORIAL BOARD
	AIMS & SCOPE
	COPYRIGHT & LICENSE
	REVIEWER
	FACEBOOK FANPAGE
	AUTHOR PROCESSING CHARGE
	OPEN ACCESS POLICY
	TEMPLATE
	PEER REVIEW PROCESS
	PUBLICATION ETHICS
	STATISTIC VIEWER
	ARCHIVING
	CROSSMARK POLICY
	FREQUENCY
	PLAGIARISM POLICY
	AUTHOR GUIDELINES
	HISTORY
	CALL REVIEWER

Identification of 10 Regional Indonesian Languages Using Machine Learning

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

How to Cite

Issue

Section

License

Current Issue

Make a Submission

Information

Developed By

Acceptance Rate Statistics