Indonesian Spelling Error Detection and Type Identification Using Bigram Vector and Minimum Edit Distance Based Probabilities

Authors

  • Emmy Erwina Universitas Harapan Medan, Indonesia
  • Tommy Tommy Universitas Harapan Medan
  • Mayasari Mayasari Universitas Harapan Medan, Indonesia

DOI:

10.33395/sinkron.v6i1.11224

Keywords:

bigram; minimum edit distance; probabilities; spelling; vector;

Abstract

Spelling error has become an error that is often found in this era which can be seen from the use of words that tend to follow trends or culture, especially in the younger generation. This study aims to develop and test a detection and identification model using a combination of Bigram Vector and Minimum Edit Distance Based Probabilities. Correct words from error words are obtained using candidates search and probability calculations that adopt the concept of minimum edit distance. The detection results then identified the error type into three types of errors, namely vowels, consonants and diphthongs from the error side on the tendency of the characters used as a result of phonemic rendering at the time of writing. The results of error detection and identification of error types obtained are quite good where most of the error test data can be detected and identified according to the type of error, although there are several detection errors by obtaining more than one correct word as a result of the same probability value of these words.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Aşliyan, R., Günel, K., & Yakhno, T. (2007). Detecting misspelled words in Turkish text using syllable n-gram frequencies. International Conference on Pattern Recognition and Machine Intelligence, (pp. 553-559). Springer, Berlin, Heidelberg. Retrieved from https://link.springer.com/chapter/10.1007/978-3-540-77046-6_68

Badan Pengembangan dan Pembinaan Bahasa, K. (2018). Kamus Besar Bahasa Indonesia Edisi 5. [Big Indonesian Dictionary 5th Edition].

Brill, E., & Moore, R. (2000). An Improved Error Model for Noisy Channel Spelling Correction. Proceedings of the (pp. 286–293). Hong Kong: Association for Computational Linguistics. doi:https://doi.org/10.3115/1075218.1075255

Christanti, V., & Naga, D. (2018). Fast and accurate spelling correction using trie and Damerau-levenshtein distance bigram. Telkomnika, 16(2), 827-833. doi:10.12928/TELKOMNIKA.v16i2.6890

Deorowicz, S., & Ciura, M. (2005). Correcting spelling errors by modelling their causes. International journal of applied mathematics and computer science, 15(2), 275-285. Retrieved from http://zbc.uz.zgora.pl/Content/330/HTML/Vol15No2-113.pdf

Elghannam, F. (2021). Text representation and classification based on bi-gram alphabet. Journal of King Saud University - Computer and Information Sciences, 33(2), 235-242. doi:https://doi.org/10.1016/j.jksuci.2019.01.005

Erwina, E. (2012). Kajian sebutan baku bahasa Indonesia. Singapore International Press.

Hládek, D., Staš, J., & Pleva, M. (2020). Survey of Automatic Spelling Correction. Electronics, 9(10), 1670. doi:https://doi.org/10.3390/electronics9101670

Kamayani, M., Reinanda, R., Simbolon, S., Soleh, M., & Purwarianti, A. (2011). Application of document spelling checker for Bahasa Indonesia. 2011 International Conference on Advanced Computer Science and Information Systems (pp. 249-252). IEEE. Retrieved from https://www.researchgate.net/profile/Mia-Kamayani/publication/254048228_Application_of_document_spelling_checker_for_Bahasa_Indonesia/links/590e7617a6fdccad7b10dff1/Application-of-document-spelling-checker-for-Bahasa-Indonesia.pdf

Martin, S., Liermann, J., & Ney, H. (1998). Algorithms for bigram and trigram word clustering. Speech communication, 24(1), 19-37. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.2354&rep=rep1&type=pdf

Ningrum, V. (2019). Penggunaan Kata Baku Dan Tidak Baku Di Kalangan Mahasiswa Universitas Pembangunan Nasional "VETERAN" YOGYAKARTA. Jurnal Skripta : Jurnal Pembelajaran Bahasa Dan Sastra Indonesia, 5(2), 22-27.

Purwantoro, D., Akbar, H., & Hidayati, A. (2019). OCR correction for Indonesian historic newspapers using word repetition, stemmer and n-gram. In Journal of Physics: Conference Series, 1193(1), 012032.

Ratnasari, C., Kusumadewi, S., & Rosita, L. (2017). A Non-Word Error Spell Checker for Patient Complaints in Bahasa Indonesia. Int. J. Inf. Technol. Comput. Sci. Open Source, 1(1), 18-21.

Rice, G., & Robinson, D. (1975). The role of bigram frequency in the perception of words and nonwords. Memory & Cognition, 3(5), 513-518. Retrieved from https://link.springer.com/content/pdf/10.3758/BF03197523.pdf

Salton, G. (1989). Automatic Text Processing. Addison-Wesley Publishing Company.

Samanta, P., & Chaudhuri, B. (2013). A simple real-word error detection and correction using local word bigram and trigram. Proceedings of the Twenty-Fifth Conference on Computational Linguistics and Speech Processing (ROCLING).

Santoso, P., Yuliawati, P., Shalahuddin, R., & Wibawa, A. (2019). Damerau levenshtein distance for indonesian spelling correction. JURNAL INFORMATIKA, 13(2), 11-15. doi:10.26555/jifo.v13i2.a15698

Sirait, Z. (2021). Penggunaan Bahasa Indonesia Di Ruang Publik Yang Tidak Memenuhi Bahasa Baku. Linguistik : Jurnal Bahasa dan Sastra, 6(1), 1-9.

Supriadin. (2016). Identifikasi Penggunaan Kosakata Baku Dalam Wacana Bahasa Indonesia Pada Siswa Kelas Vii Di Smp Negeri 1 Wera Kabupaten Bima Tahun Pelajaran 2013/2014. JIME : Jurnal Ilmiah Mandala Education, 2(2), 150-161.

Tong, X., & Evans, D. (1996). A statistical approach to automatic OCR error correction in context. In Fourth workshop on very large corpora. Retrieved from https://aclanthology.org/W96-0108.pdf

Wibawa, A., Yuliawati, P., Santoso, P., Shalahuddin, R., & Wirawan, I. (2020). Damerau Levenshtain Distance dengan Metode Empiris untuk Koreksi Ejaan Bahasa Indonesia. ILKOM Jurnal Ilmiah, 12(3), 176-182.

Downloads


Crossmark Updates

How to Cite

Erwina, E. ., Tommy, T., & Mayasari, M. (2021). Indonesian Spelling Error Detection and Type Identification Using Bigram Vector and Minimum Edit Distance Based Probabilities. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 5(2B), 183-190. https://doi.org/10.33395/sinkron.v6i1.11224