Adaptive Learning System Based on Human-in-the-Loop for PDF Template Data Extraction

Authors

  • Moh Syaiful Rahman Magister Teknologi Informasi, Fakultas Teknologi Komunikasi dan Informatika, Universitas Nasional
  • Andrianingsih Magister Teknologi Informasi, Fakultas Teknologi Komunikasi dan Informatika, Universitas Nasional

DOI:

10.33395/sinkron.v10i1.15598

Keywords:

Adaptive Learning, Conditional Random Fields, Human-in-the-Loop, Hybrid Architecture, Incremental Learning, PDF Data Extraction, Template Processing

Abstract

PDF template data extraction remains a substantial challenge due to semi-structured document formats and variations. While large pre-trained models achieve high accuracy, they require extensive computational resources and labeled datasets, making them impractical for resource-constrained environments. Conversely, rule-based approaches are efficient but rigid. This research addresses this gap by developing an adaptive learning system that integrates rule-based approaches with Conditional Random Fields (CRF) in a hybrid framework, designed for data-scarce scenarios. The system implements parallel extraction strategies with confidence-based selection and Human-in-the-Loop (HITL) feedback for incremental learning. Pattern learning updates rule-based strategies, while CRF models are retrained incrementally. Evaluated on synthetically generated documents across diverse template types, the system achieves 98.61% accuracy with minimal training data and 7% user correction rate, demonstrating high learning efficiency (1.88 corrections per percentage point). The improvement is statistically significant (paired t-test, p < 0.001, Cohen’s d = 8.95). The system operates on CPU-only hardware with 50-100 MB footprint and 0.1-0.5 seconds processing time. This work fills a practical gap in document extraction, providing a middle-ground solution balancing high accuracy, minimal data requirements, low resource consumption, and real-time adaptability—suitable for small organizations and rapid deployment where large models are impractical. The evaluation uses synthetic data to ensure reproducibility and controlled assessment, though real-world validation would strengthen practical applicability.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. (2021). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–16. https://doi.org/10.1145/3411764.3445717

Cui, L., Xu, Y., Lv, T., & Wei, F. (2021). Document AI: Benchmarks, Models and Applications (No. arXiv:2111.08609). arXiv. https://doi.org/10.48550/arXiv.2111.08609

Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1. https://doi.org/10.1109/TPAMI.2021.3057446

Dudley, J. J., & Kristensson, P. O. (2018). A Review of User Interface Design for Interactive Machine Learning. ACM Transactions on Interactive Intelligent Systems, 8(2), 1–37. https://doi.org/10.1145/3185517

Garncarek, Ł., Powalski, R., Stanisławek, T., Topolski, B., Halama, P., Turski, M., & Graliński, F. (2021). LAMBERT: Layout-Aware Language Modeling for Information Extraction. In J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document Analysis and Recognition – ICDAR 2021 (Vol. 12821, pp. 532–547). Springer International Publishing. https://doi.org/10.1007/978-3-030-86549-8_34

Gebauer, M., Maschhur, F., Leschke, N., Grünewald, E., & Pallas, F. (2023). A ‘Human-in-the-Loop’ approach for Information Extraction from Privacy Policies under Data Scarcity. 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 76–83. https://doi.org/10.1109/EuroSPW59978.2023.00014

Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., & Park, S. (2022). BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10767–10775. https://doi.org/10.1609/aaai.v36i10.21322

Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Proceedings of the 30th ACM International Conference on Multimedia, 4083–4091. https://doi.org/10.1145/3503161.3548112

International Organization for Standardization. (2008). Document management—Portable document format—Part 1: PDF 1.7 (No. ISO 32000-1:2008). ISO. https://www.iso.org/standard/51502.html

Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., & Ding, E. (2021). StrucTexT: Structured Text Understanding with Multi-Modal Transformers. Proceedings of the 29th ACM International Conference on Multimedia, 1912–1920. https://doi.org/10.1145/3474085.3475345

Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005–3054. https://doi.org/10.1007/s10462-022-10246-w

Munro, R. (with Safari, an O’Reilly Media Company). (2021). Human-in-the-Loop Machine Learning (1st edition). Manning Publications.

Palm, R. B., Winther, O., & Laws, F. (2017). CloudScan—A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 406–413. https://doi.org/10.1109/icdar.2017.74

Popovic, N., & Färber, M. (2022). Few-Shot Document-Level Relation Extraction. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5733–5746. https://doi.org/10.18653/v1/2022.naacl-main.421

Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M., & Pałka, G. (2021). Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (No. arXiv:2102.09550). arXiv. https://doi.org/10.48550/arXiv.2102.09550

Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. (2022). A Survey of Deep Active Learning. ACM Computing Surveys, 54(9), 1–40. https://doi.org/10.1145/3472291

Schleith, J., Hoffmann, H., Norkute, M., & Cechmanek, B. (2022). Human in the loop information extraction increases efficiency and trust. https://doi.org/10.18420/MUC2022-MCI-WS12-249

Schroeder, N. L., Jaldi, C. D., & Zhang, S. (2025). Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction (No. arXiv:2501.11840). arXiv. https://doi.org/10.48550/arXiv.2501.11840

Settles, B. (2012). Active Learning. Springer International Publishing. https://doi.org/10.1007/978-3-031-01560-1

Wu, T., Terry, M., & Cai, C. J. (2022). AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. CHI Conference on Human Factors in Computing Systems, 1–22. https://doi.org/10.1145/3491102.3517582

Downloads


Crossmark Updates

How to Cite

Rahman, M. S. ., & Andrianingsih , A. . (2026). Adaptive Learning System Based on Human-in-the-Loop for PDF Template Data Extraction. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 10(1), 145-160. https://doi.org/10.33395/sinkron.v10i1.15598