Identification of 10 Regional Indonesian Languages Using Machine Learning


  • Azhar Baihaqi Nugraha School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia
  • Ade Romadhony School of Computing, Study Program of Informatics, Telkom University, Bandung, Indonesia




Text Classification, Language Identification, Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes Classifier (NBC)


Language Identification plays a pivotal role in deciphering the rich tapestry of Indonesia's diverse regional languages, encompassing a wide spectrum of scripts, and spoken forms. Language Identification, an integral component of Natural Language Processing, is frequently addressed through Text Classification. In this study, we embark on the task of identifying 10 Indonesian languages, leveraging the NusaX dataset, with the overarching objective of contextual language determination. To achieve this, we harness a diverse array of machine learning techniques, including Support Vector Machine, Naïve Bayes Classifier, Decision Tree, Rocchio Classification, Logistic Regression, and Random Forest. We complement these methods with two distinct feature extraction approaches: N-gram and TF-IDF. This comprehensive approach enables us to construct robust models for language identification. Our findings unveil the strong efficacy of these models in discerning Indonesian languages, with the Naïve Bayes Classifier emerging as the frontrunner, achieving an impressive accuracy rate of 99.2% with TF-IDF and an even more remarkable 99.4% with N-Gram. To gain deeper insights, we delve into error analysis, revealing that misclassifications often stem from shared words across different languages. This research is underpinned by the necessity for a robust language identification model, underscoring its critical role within the complex linguistic landscape of Indonesian regional languages. These results hold great promise for applications in automated language processing and understanding within this diverse and multifaceted linguistic context.

Nugraha, A. B., & Ade Romadhony. (2023). Identification of 10 Regional Indonesian Languages Using Machine Learning . Sinkron : Jurnal Dan Penelitian Teknik Informatika, 8(4), 2203-2214.