Improving Tesseract OCR Accuracy Using SymSpell Algorithm on Passport Data
DOI:
10.33395/sinkron.v9i1.14395Keywords:
Corpus, Optical Character Recognition, Passport, SymSpell, TesseractAbstract
Optical Character Recognition (OCR) is a technology used to recognize text from images or digital documents, such as passports. One popular OCR tool is Tesseract as it offers high accuracy. However, OCR accuracy is often affected by various factors, including image noise and/or non-text elements. This article discusses the application of the SymSpell algorithm for post processing to improve OCR accuracy on standard Indonesian passports. OCR will be focused on the Visual Inspection Zone, specifically the Place of Birth and Issuing Office values. Unlike the Machine Readable Zone which is composed of individual codes and a clear background, the Visual Inspection Zone often experiences OCR errors due to holograms blocking the text and spaced layouts. SymSpell is an edit distance based spelling correction algorithm designed to process data quickly and efficiently, even on very huge datasets. In this study, SymSpell is used to detect and correct errors in OCR results that are compared to a corpus word list. Experimental results with 10 tested scans and passport photos showed that the integration of SymSpell with the Research and Development methodology was able to improve the OCR accuracy rate by 21,43% for certain Place of Birth and Issuing Office data from the Visual Inspection Zone. With this approach, OCR systems can provide more reliable results for practical applications.
Downloads
References
Audah, H. A., Yuliawati, A., & Alfina, I. (2023). A Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure. 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), 1–6. https://doi.org/10.1109/ICAICTA59291.2023.10390399
Aung, T., Thu, Y. K., & Oo, M. N. (2024). myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction. 2024 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 1–6. https://doi.org/10.1109/iSAI-NLP64410.2024.10799448
Bessmeltsev, V., Bulushev, E., & Goloshevsky, N. (n.d.). High-speed OCR algorithm for portable passport readers.
Bjerring-Hansen, J., Kristensen-McLachlan, R. D., Diderichsen, P., & Hansen, D. H. (2022). Mending Fractured Texts. A heuristic procedure for correcting OCR data: 6th Digital Humanities in the Nordic and Baltic Countries Conference, DHNB 2022. CEUR Workshop Proceedings, 3232, 177–186.
Çeli̇K, A. (2021). Eğik Karakter Tanıma Başarısını Arttırmak için Yeni Bir Yöntemin Kullanılması. Harran Üniversitesi Mühendislik Dergisi, 6(1), 1–11. https://doi.org/10.46578/humder.720001
de Oliveira, L. L., Vargas, D. S., Alexandre, A. M. A., Cordeiro, F. C., Gomes, D. da S. M., Rodrigues, M. de C., Romeu, R. K., & Moreira, V. P. (2023). Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries, 24(1), 45–62. https://doi.org/10.1007/s00799-023-00345-6
DoubangoTelecom. (n.d.). DoubangoTelecom/tesseractMRZ: Ready-to-use MRZ / MRTD (Machine-readable zone/travel documents) dataset and models for tesseract v4. Retrieved January 23, 2025, from https://github.com/DoubangoTelecom/tesseractMRZ
Ferdiansyah, M. H., & Nuryana, I. K. D. (2023). Analisis Perbandingan Metode Burkhard Keller Tree dan SymSpell dalam Spell Correction Bahasa Indonesia. Journal of Informatics and Computer Science (JINACS), 305–313. https://doi.org/10.26740/jinacs.v4n03.p305-313
Garbe, W. (2012a, June 7). 1000x Faster Spelling Correction algorithm. SeekStorm. https://seekstorm.com/blog/1000x-spelling-correction/
Garbe, W. (2012b, June 7). SymSpell. https://github.com/wolfgarbe/SymSpell
Hemmer, A., Brachat, J., Coustaty, M., & Ogier, J.-M. (2023). Estimating Post-OCR Denoising Complexity on Numerical Texts. In N. T. Nguyen, S. Boonsang, H. Fujita, B. Hnatkowska, T.-P. Hong, K. Pasupa, & A. Selamat (Eds.), Recent Challenges in Intelligent Information and Database Systems (pp. 67–79). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42430-4_6
Konanykhin, A., Konanykhina, T., & Panishchev, V. (2023). Character Recognition in Images under High Noise Levels. 2023 International Russian Automation Conference (RusAutoCon), 930–935. https://doi.org/10.1109/RusAutoCon58002.2023.10272925
Kurniawan, Z. (2023). DAYA SAING SUMBER DAYA MANUSIA DI ERA DIGITALISASI. Jurnal EBI, 5(2), Article 2. https://doi.org/10.52061/ebi.v5i2.182
Moussaoui, H., Akkad, N. E., Benslimane, M., El-Shafai, W., Baihan, A., Hewage, C., & Rathore, R. S. (2024). Enhancing automated vehicle identification by integrating YOLO v8 and OCR techniques for high-precision license plate detection and recognition. Scientific Reports, 14(1), 14389. https://doi.org/10.1038/s41598-024-65272-1
Mubeen, Dr. S., Brahmani, J., Kalyan, D. P., Jagirdar, A., & Kumar, A. P. (2022). Optical Character Recognition Using Tesseract. International Journal for Research in Applied Science and Engineering Technology, 10(11), 672–675. https://doi.org/10.22214/ijraset.2022.47414
Paspor Biasa – Kantor Imigrasi Kelas I Non TPI Depok. (n.d.). Retrieved November 29, 2024, from https://depok.imigrasi.go.id/paspor-biasa/
Sukmadinata, N. S. (2012). Metode penelitian pendidikan. Bandung: PT Remaja Rosdakarya.
Tolegenova, A. (2022). AUTOMATIC ERROR CORRECTION: EVALUATINGPERFORMANCE OF SPELL CHECKER TOOLS. Natural and Technical Sciences, 58(1), Article 1. https://doi.org/10.47344/sdubnts.v58i1.690
Wynarti, I. A. (2018). PENGEMBANGAN PERMAINAN CHARADES SEBAGAI MEDIA PEMBELAJARAN MATERI JENIS-JENIS BISNIS RITEL KELAS XI PEMASARAN DI SMK NEGERI 2 BUDURAN. Jurnal Pendidikan Tata Niaga (JPTN), 6(2). https://doi.org/10.26740/jptn.v6n2.p%p
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Iqbaluddin Syam Had, Wiga Maulana Baihaqi, Dwi Putriana Nuramanah Kinding

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.