Leveraging Label Preprocessing for Effective End-to-End Indonesian Automatic Speech Recognition

Mohammad Noval Althoff; Affandy Affandy; Ardytha Luthfiarta; Mohammad Wahyu Bagus Dwi Satya; Halizah Basiron

doi:10.33395/sinkron.v9i1.14257

Authors

Mohammad Noval Althoff Dian Nuswantoro University
Affandy Affandy Faculty of Computer Science, Dian Nuswantoro University, Semarang, Indonesia
Ardytha Luthfiarta Faculty of Computer Science, Dian Nuswantoro University, Semarang, Indonesia
Mohammad Wahyu Bagus Dwi Satya Faculty of Computer Science, Dian Nuswantoro University, Semarang, Indonesia
Halizah Basiron Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM), Malaysia

DOI:

10.33395/sinkron.v9i1.14257

Keywords:

Automatic Speech Recognition (ASR), Label Preprocessing, Low-Resource Language, Self-Supervised Speech Representation Learning, wav2vec 2.0

Abstract

This research explores the potential of improving low-resource Automatic Speech Recognition (ASR) performance by leveraging label preprocessing techniques in conjunction with the wav2vec2-large Self-Supervised Learning (SSL) model. ASR technology plays a critical role in enhancing educational accessibility for children with disabilities in Indonesia, yet its development faces challenges due to limited labeled datasets. SSL models like wav2vec 2.0 have shown promise by learning rich speech representations from raw audio with minimal labeled data. Still, their dependence on large datasets and significant computational resources limits their application in low-resource settings. This study introduces a label preprocessing technique to address these limitations, comparing three scenarios: training without preprocessing, with the proposed preprocessing method, and with an alternative method. Using only 16 hours of labeled data, the proposed preprocessing approach achieves a Word Error Rate (WER) of 15.83%, significantly outperforming the baseline scenario (33.45% WER) and the alternative preprocessing method (19.62% WER). Further training using the proposed preprocessing technique with increased epochs reduces the WER to 14.00%. These results highlight the effectiveness of label preprocessing in reducing data dependency while enhancing model performance. The findings demonstrate the feasibility of developing robust ASR models for low-resource languages, offering a scalable solution for advancing ASR technology and improving educational accessibility, particularly for underrepresented languages.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Abidin, T. F., Misbullah, A., Ferdhiana, R., Aksana, M. Z., & Farsiah, L. (2020). Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types. Proceedings of the International Conference on Electrical Engineering and Informatics (ICELTICs), IEEE, 1-5. https://doi.org/10.1109/ICELTICs50595.2020.9315538

Abidin, T. F., Misbullah, A., Ferdhiana, R., Farsiah, L., Aksana, M. Z., & Riza, H. (2022). Acoustic Model with Multiple Lexicon Types for Indonesian Speech Recognition. Applied Computational Intelligence and Soft Computing. https://doi.org/10.1155/2022/3227828

Aji, A. F., Winata, G. I., Koto, F., Cahyawijaya, S., Romadhony, A., Mahendra, R., … Ruder, S. (2022). One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1, 7226–7249. https://doi.org/10.48550/arXiv.2203.13357

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., … Weber, G. (2020). Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 4218–4222. https://doi.org/10.48550/arXiv.1912.06670

Arisaputra, P., & Zahra, A. (2022). Indonesian Automatic Speech Recognition with XLSR-53. Ingenierie Des Systemes d’Information, 27(6), 973–982. https://doi.org/10.18280/isi.270614

Ashshidiqi, M. H., & Wijiastuti, A. (2020). Teknologi Asistif Text To Speech (TTS) Pada Kemapuan Membaca pemahaman Anak Disleksia. Jurnal Pendidikan Khusus, 15(1).

Baevski, A., Hsu, W. N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. Proceedings of the 39th International Conference on Machine Learning, PMLR, 162, 1298–1312. https://doi.org/10.48550/arXiv.2202.03555

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449-12460. https://doi.org/10.48550/arXiv.2006.11477

Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H., & Lee, L. (2018). Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data. arXiv preprint arXiv:1810.12566. https://doi.org/10.48550/arXiv.1810.12566

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., … Bapna, A. (2023). FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. 2022 IEEE Spoken Language Technology Workshop, (SLT), 798–805. https://doi.org/10.1109/SLT54892.2023.10023141

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.48550/arXiv.1810.04805

Dubey, P., & Shah, B. (2022). Deep Speech Based End-to-End Automated Speech Recognition (ASR) for Indian-English Accents. arXiv preprint arXiv:2204.00977. https://doi.org/10.48550/arXiv.2204.00977

Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 3451–3460. https://doi.org/10.1109/TASLP.2021.3122291

Laptev, A., Majumdar, S., & Ginsburg, B. (2022). CTC Variations Through New WFST Topologies. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1041–1045. https://doi.org/10.21437/Interspeech.2022-10854

Malik, M., Malik, M. K., Mehmood, K., & Makhdoom, I. (2021). Automatic speech recognition: a survey. Multimedia Tools and Applications, 80(6), 9411–9457. https://doi.org/10.1007/s11042-020-10073-7

McFee, B., McVicar, M., Faronbi, D., Roman, I., Gover, M., Balke, S., … Pimenta, W. (2024). librosa/librosa: 0.10.2.post1. Zenodo. https://doi.org/10.5281/zenodo.11192913

Suyanto, S., Arifianto, A., Sirwan, A., & Rizaendra, A. P. (2020). End-to-End Speech Recognition Models for a Low-Resourced Indonesian Language. 2020 8th International Conference on Information and Communication Technology (ICoICT), IEEE, (ii), 1-6. https://doi.org/10.1109/ICoICT49345.2020.9166346

Tawaqal, B., & Suyanto, S. (2021). Recognizing Five Major Dialects in Indonesia Based on MFCC and DRNN. Journal of Physics: Conference Series, 1844(1). https://doi.org/10.1088/1742-6596/1844/1/012003

Vāravs, A., & Salimbajevs, A. (2018). Restoring Punctuation and Capitalization Using Transformer Models. Statistical Language and Speech Processing: 6th International Conference, 91–102. https://doi.org/10.1007/978-3-030-00810-9_9

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. 31st Conference on Neural Information Processing Systems.

Zeyer, A., Schlüter, R., & Ney, H. (2021). Why does CTC result in peaky behavior? arXiv preprint arXiv:2105.14849. https://doi.org/10.48550/arXiv.2105.14849

	CONTACT US
	EDITORIAL BOARD
	AIMS & SCOPE
	COPYRIGHT & LICENSE
	REVIEWER
	FACEBOOK FANPAGE
	AUTHOR PROCESSING CHARGE
	OPEN ACCESS POLICY
	TEMPLATE
	PEER REVIEW PROCESS
	PUBLICATION ETHICS
	STATISTIC VIEWER
	ARCHIVING
	CROSSMARK POLICY
	FREQUENCY
	PLAGIARISM POLICY
	AUTHOR GUIDELINES
	HISTORY
	CALL REVIEWER

Leveraging Label Preprocessing for Effective End-to-End Indonesian Automatic Speech Recognition

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

How to Cite

Issue

Section

License

Current Issue

Make a Submission

Information

Developed By

Acceptance Rate Statistics