Improving Tesseract OCR Accuracy Using SymSpell Algorithm on Passport Data

Authors

  • Iqbaluddin Syam Had Amikom Purwokerto University
  • Wiga Maulana Baihaqi Amikom Purwokerto University
  • Dwi Putriana Nuramanah Kinding Jenderal Soedirman University

DOI:

10.33395/sinkron.v9i1.14395

Keywords:

Corpus, Optical Character Recognition, Passport, SymSpell, Tesseract

Abstract

Optical Character Recognition (OCR) is a technology used to recognize text from images or digital documents, such as passports. One popular OCR tool is Tesseract as it offers high accuracy. However, OCR accuracy is often affected by various factors, including image noise and/or non-text elements. This article discusses the application of the SymSpell algorithm for post processing to improve OCR accuracy on standard Indonesian passports. OCR will be focused on the Visual Inspection Zone, specifically the Place of Birth and Issuing Office values. Unlike the Machine Readable Zone which is composed of individual codes and a clear background, the Visual Inspection Zone often experiences OCR errors due to holograms blocking the text and spaced layouts. SymSpell is an edit distance based spelling correction algorithm designed to process data quickly and efficiently, even on very huge datasets. In this study, SymSpell is used to detect and correct errors in OCR results that are compared to a corpus word list. Experimental results with 10 tested scans and passport photos showed that the integration of SymSpell with the Research and Development methodology was able to improve the OCR accuracy rate by 21,43% for certain Place of Birth and Issuing Office data from the Visual Inspection Zone. With this approach, OCR systems can provide more reliable results for practical applications.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Audah, H. A., Yuliawati, A., & Alfina, I. (2023). A Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure. 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), 1–6. https://doi.org/10.1109/ICAICTA59291.2023.10390399

Aung, T., Thu, Y. K., & Oo, M. N. (2024). myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction. 2024 19th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 1–6. https://doi.org/10.1109/iSAI-NLP64410.2024.10799448

Bessmeltsev, V., Bulushev, E., & Goloshevsky, N. (n.d.). High-speed OCR algorithm for portable passport readers.

Bjerring-Hansen, J., Kristensen-McLachlan, R. D., Diderichsen, P., & Hansen, D. H. (2022). Mending Fractured Texts. A heuristic procedure for correcting OCR data: 6th Digital Humanities in the Nordic and Baltic Countries Conference, DHNB 2022. CEUR Workshop Proceedings, 3232, 177–186.

Çeli̇K, A. (2021). Eğik Karakter Tanıma Başarısını Arttırmak için Yeni Bir Yöntemin Kullanılması. Harran Üniversitesi Mühendislik Dergisi, 6(1), 1–11. https://doi.org/10.46578/humder.720001

de Oliveira, L. L., Vargas, D. S., Alexandre, A. M. A., Cordeiro, F. C., Gomes, D. da S. M., Rodrigues, M. de C., Romeu, R. K., & Moreira, V. P. (2023). Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries, 24(1), 45–62. https://doi.org/10.1007/s00799-023-00345-6

DoubangoTelecom. (n.d.). DoubangoTelecom/tesseractMRZ: Ready-to-use MRZ / MRTD (Machine-readable zone/travel documents) dataset and models for tesseract v4. Retrieved January 23, 2025, from https://github.com/DoubangoTelecom/tesseractMRZ

Ferdiansyah, M. H., & Nuryana, I. K. D. (2023). Analisis Perbandingan Metode Burkhard Keller Tree dan SymSpell dalam Spell Correction Bahasa Indonesia. Journal of Informatics and Computer Science (JINACS), 305–313. https://doi.org/10.26740/jinacs.v4n03.p305-313

Garbe, W. (2012a, June 7). 1000x Faster Spelling Correction algorithm. SeekStorm. https://seekstorm.com/blog/1000x-spelling-correction/

Garbe, W. (2012b, June 7). SymSpell. https://github.com/wolfgarbe/SymSpell

Hemmer, A., Brachat, J., Coustaty, M., & Ogier, J.-M. (2023). Estimating Post-OCR Denoising Complexity on Numerical Texts. In N. T. Nguyen, S. Boonsang, H. Fujita, B. Hnatkowska, T.-P. Hong, K. Pasupa, & A. Selamat (Eds.), Recent Challenges in Intelligent Information and Database Systems (pp. 67–79). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42430-4_6

Konanykhin, A., Konanykhina, T., & Panishchev, V. (2023). Character Recognition in Images under High Noise Levels. 2023 International Russian Automation Conference (RusAutoCon), 930–935. https://doi.org/10.1109/RusAutoCon58002.2023.10272925

Kurniawan, Z. (2023). DAYA SAING SUMBER DAYA MANUSIA DI ERA DIGITALISASI. Jurnal EBI, 5(2), Article 2. https://doi.org/10.52061/ebi.v5i2.182

Moussaoui, H., Akkad, N. E., Benslimane, M., El-Shafai, W., Baihan, A., Hewage, C., & Rathore, R. S. (2024). Enhancing automated vehicle identification by integrating YOLO v8 and OCR techniques for high-precision license plate detection and recognition. Scientific Reports, 14(1), 14389. https://doi.org/10.1038/s41598-024-65272-1

Mubeen, Dr. S., Brahmani, J., Kalyan, D. P., Jagirdar, A., & Kumar, A. P. (2022). Optical Character Recognition Using Tesseract. International Journal for Research in Applied Science and Engineering Technology, 10(11), 672–675. https://doi.org/10.22214/ijraset.2022.47414

Paspor Biasa – Kantor Imigrasi Kelas I Non TPI Depok. (n.d.). Retrieved November 29, 2024, from https://depok.imigrasi.go.id/paspor-biasa/

Sukmadinata, N. S. (2012). Metode penelitian pendidikan. Bandung: PT Remaja Rosdakarya.

Tolegenova, A. (2022). AUTOMATIC ERROR CORRECTION: EVALUATINGPERFORMANCE OF SPELL CHECKER TOOLS. Natural and Technical Sciences, 58(1), Article 1. https://doi.org/10.47344/sdubnts.v58i1.690

Wynarti, I. A. (2018). PENGEMBANGAN PERMAINAN CHARADES SEBAGAI MEDIA PEMBELAJARAN MATERI JENIS-JENIS BISNIS RITEL KELAS XI PEMASARAN DI SMK NEGERI 2 BUDURAN. Jurnal Pendidikan Tata Niaga (JPTN), 6(2). https://doi.org/10.26740/jptn.v6n2.p%p

Downloads


Crossmark Updates

How to Cite

Had, I. S. ., Maulana Baihaqi, W., & Putriana Nuramanah Kinding, D. (2025). Improving Tesseract OCR Accuracy Using SymSpell Algorithm on Passport Data. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 9(1), 374-381. https://doi.org/10.33395/sinkron.v9i1.14395