Leveraging Label Preprocessing for Effective End-to-End Indonesian Automatic Speech Recognition
DOI:
10.33395/sinkron.v9i1.14257Keywords:
Automatic Speech Recognition (ASR), Label Preprocessing, Low-Resource Language, Self-Supervised Speech Representation Learning, wav2vec 2.0Abstract
This research explores the potential of improving low-resource Automatic Speech Recognition (ASR) performance by leveraging label preprocessing techniques in conjunction with the wav2vec2-large Self-Supervised Learning (SSL) model. ASR technology plays a critical role in enhancing educational accessibility for children with disabilities in Indonesia, yet its development faces challenges due to limited labeled datasets. SSL models like wav2vec 2.0 have shown promise by learning rich speech representations from raw audio with minimal labeled data. Still, their dependence on large datasets and significant computational resources limits their application in low-resource settings. This study introduces a label preprocessing technique to address these limitations, comparing three scenarios: training without preprocessing, with the proposed preprocessing method, and with an alternative method. Using only 16 hours of labeled data, the proposed preprocessing approach achieves a Word Error Rate (WER) of 15.83%, significantly outperforming the baseline scenario (33.45% WER) and the alternative preprocessing method (19.62% WER). Further training using the proposed preprocessing technique with increased epochs reduces the WER to 14.00%. These results highlight the effectiveness of label preprocessing in reducing data dependency while enhancing model performance. The findings demonstrate the feasibility of developing robust ASR models for low-resource languages, offering a scalable solution for advancing ASR technology and improving educational accessibility, particularly for underrepresented languages.
Downloads
References
Abidin, T. F., Misbullah, A., Ferdhiana, R., Aksana, M. Z., & Farsiah, L. (2020). Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types. Proceedings of the International Conference on Electrical Engineering and Informatics (ICELTICs), IEEE, 1-5. https://doi.org/10.1109/ICELTICs50595.2020.9315538
Abidin, T. F., Misbullah, A., Ferdhiana, R., Farsiah, L., Aksana, M. Z., & Riza, H. (2022). Acoustic Model with Multiple Lexicon Types for Indonesian Speech Recognition. Applied Computational Intelligence and Soft Computing. https://doi.org/10.1155/2022/3227828
Aji, A. F., Winata, G. I., Koto, F., Cahyawijaya, S., Romadhony, A., Mahendra, R., … Ruder, S. (2022). One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1, 7226–7249. https://doi.org/10.48550/arXiv.2203.13357
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., … Weber, G. (2020). Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 4218–4222. https://doi.org/10.48550/arXiv.1912.06670
Arisaputra, P., & Zahra, A. (2022). Indonesian Automatic Speech Recognition with XLSR-53. Ingenierie Des Systemes d’Information, 27(6), 973–982. https://doi.org/10.18280/isi.270614
Ashshidiqi, M. H., & Wijiastuti, A. (2020). Teknologi Asistif Text To Speech (TTS) Pada Kemapuan Membaca pemahaman Anak Disleksia. Jurnal Pendidikan Khusus, 15(1).
Baevski, A., Hsu, W. N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. Proceedings of the 39th International Conference on Machine Learning, PMLR, 162, 1298–1312. https://doi.org/10.48550/arXiv.2202.03555
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449-12460. https://doi.org/10.48550/arXiv.2006.11477
Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H., & Lee, L. (2018). Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data. arXiv preprint arXiv:1810.12566. https://doi.org/10.48550/arXiv.1810.12566
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., … Bapna, A. (2023). FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. 2022 IEEE Spoken Language Technology Workshop, (SLT), 798–805. https://doi.org/10.1109/SLT54892.2023.10023141
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.48550/arXiv.1810.04805
Dubey, P., & Shah, B. (2022). Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents. arXiv preprint arXiv:2204.00977. https://doi.org/10.48550/arXiv.2204.00977
Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 3451–3460. https://doi.org/10.1109/TASLP.2021.3122291
Laptev, A., Majumdar, S., & Ginsburg, B. (2022). CTC Variations Through New WFST Topologies. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1041–1045. https://doi.org/10.21437/Interspeech.2022-10854
Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: a survey. Multimedia Tools and Applications, 80(6), 9411–9457. https://doi.org/10.1007/s11042-020-10073-7
McFee, B., McVicar, M., Faronbi, D., Roman, I., Gover, M., Balke, S., … Pimenta, W. (2024). librosa/librosa: 0.10.2.post1. Zenodo. https://doi.org/10.5281/zenodo.11192913
Suyanto, S., Arifianto, A., Sirwan, A., & Rizaendra, A. P. (2020). End-to-End Speech Recognition Models for a Low-Resourced Indonesian Language. 2020 8th International Conference on Information and Communication Technology (ICoICT), IEEE, (ii), 1-6. https://doi.org/10.1109/ICoICT49345.2020.9166346
Tawaqal, B., & Suyanto, S. (2021). Recognizing Five Major Dialects in Indonesia Based on MFCC and DRNN. Journal of Physics: Conference Series, 1844(1). https://doi.org/10.1088/1742-6596/1844/1/012003
Vāravs, A., & Salimbajevs, A. (2018). Restoring Punctuation and Capitalization Using Transformer Models. Statistical Language and Speech Processing: 6th International Conference, 91–102. https://doi.org/10.1007/978-3-030-00810-9_9
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems.
Zeyer, A., Schlüter, R., & Ney, H. (2021). Why does CTC result in peaky behavior? arXiv preprint arXiv:2105.14849. https://doi.org/10.48550/arXiv.2105.14849
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Mohammad Noval Althoff, Affandy Affandy, Ardytha Luthfiarta, Mohammad Wahyu Bagus Dwi Satya, Halizah Basiron

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.