Adaptive Learning System Based on Human-in-the-Loop for PDF Template Data Extraction
DOI:
10.33395/sinkron.v10i1.15598Keywords:
Adaptive Learning, Conditional Random Fields, Human-in-the-Loop, Hybrid Architecture, Incremental Learning, PDF Data Extraction, Template ProcessingAbstract
PDF template data extraction remains a substantial challenge due to semi-structured document formats and variations. While large pre-trained models achieve high accuracy, they require extensive computational resources and labeled datasets, making them impractical for resource-constrained environments. Conversely, rule-based approaches are efficient but rigid. This research addresses this gap by developing an adaptive learning system that integrates rule-based approaches with Conditional Random Fields (CRF) in a hybrid framework, designed for data-scarce scenarios. The system implements parallel extraction strategies with confidence-based selection and Human-in-the-Loop (HITL) feedback for incremental learning. Pattern learning updates rule-based strategies, while CRF models are retrained incrementally. Evaluated on synthetically generated documents across diverse template types, the system achieves 98.61% accuracy with minimal training data and 7% user correction rate, demonstrating high learning efficiency (1.88 corrections per percentage point). The improvement is statistically significant (paired t-test, p < 0.001, Cohen’s d = 8.95). The system operates on CPU-only hardware with 50-100 MB footprint and 0.1-0.5 seconds processing time. This work fills a practical gap in document extraction, providing a middle-ground solution balancing high accuracy, minimal data requirements, low resource consumption, and real-time adaptability—suitable for small organizations and rapid deployment where large models are impractical. The evaluation uses synthetic data to ensure reproducibility and controlled assessment, though real-world validation would strengthen practical applicability.
Downloads
References
Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. (2021). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1–16. https://doi.org/10.1145/3411764.3445717
Cui, L., Xu, Y., Lv, T., & Wei, F. (2021). Document AI: Benchmarks, Models and Applications (No. arXiv:2111.08609). arXiv. https://doi.org/10.48550/arXiv.2111.08609
Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1. https://doi.org/10.1109/TPAMI.2021.3057446
Dudley, J. J., & Kristensson, P. O. (2018). A Review of User Interface Design for Interactive Machine Learning. ACM Transactions on Interactive Intelligent Systems, 8(2), 1–37. https://doi.org/10.1145/3185517
Garncarek, Ł., Powalski, R., Stanisławek, T., Topolski, B., Halama, P., Turski, M., & Graliński, F. (2021). LAMBERT: Layout-Aware Language Modeling for Information Extraction. In J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document Analysis and Recognition – ICDAR 2021 (Vol. 12821, pp. 532–547). Springer International Publishing. https://doi.org/10.1007/978-3-030-86549-8_34
Gebauer, M., Maschhur, F., Leschke, N., Grünewald, E., & Pallas, F. (2023). A ‘Human-in-the-Loop’ approach for Information Extraction from Privacy Policies under Data Scarcity. 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 76–83. https://doi.org/10.1109/EuroSPW59978.2023.00014
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., & Park, S. (2022). BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 10767–10775. https://doi.org/10.1609/aaai.v36i10.21322
Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Proceedings of the 30th ACM International Conference on Multimedia, 4083–4091. https://doi.org/10.1145/3503161.3548112
International Organization for Standardization. (2008). Document management—Portable document format—Part 1: PDF 1.7 (No. ISO 32000-1:2008). ISO. https://www.iso.org/standard/51502.html
Li, Y., Qian, Y., Yu, Y., Qin, X., Zhang, C., Liu, Y., Yao, K., Han, J., Liu, J., & Ding, E. (2021). StrucTexT: Structured Text Understanding with Multi-Modal Transformers. Proceedings of the 29th ACM International Conference on Multimedia, 1912–1920. https://doi.org/10.1145/3474085.3475345
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005–3054. https://doi.org/10.1007/s10462-022-10246-w
Munro, R. (with Safari, an O’Reilly Media Company). (2021). Human-in-the-Loop Machine Learning (1st edition). Manning Publications.
Palm, R. B., Winther, O., & Laws, F. (2017). CloudScan—A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 406–413. https://doi.org/10.1109/icdar.2017.74
Popovic, N., & Färber, M. (2022). Few-Shot Document-Level Relation Extraction. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5733–5746. https://doi.org/10.18653/v1/2022.naacl-main.421
Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M., & Pałka, G. (2021). Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (No. arXiv:2102.09550). arXiv. https://doi.org/10.48550/arXiv.2102.09550
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. (2022). A Survey of Deep Active Learning. ACM Computing Surveys, 54(9), 1–40. https://doi.org/10.1145/3472291
Schleith, J., Hoffmann, H., Norkute, M., & Cechmanek, B. (2022). Human in the loop information extraction increases efficiency and trust. https://doi.org/10.18420/MUC2022-MCI-WS12-249
Schroeder, N. L., Jaldi, C. D., & Zhang, S. (2025). Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction (No. arXiv:2501.11840). arXiv. https://doi.org/10.48550/arXiv.2501.11840
Settles, B. (2012). Active Learning. Springer International Publishing. https://doi.org/10.1007/978-3-031-01560-1
Wu, T., Terry, M., & Cai, C. J. (2022). AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. CHI Conference on Human Factors in Computing Systems, 1–22. https://doi.org/10.1145/3491102.3517582
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2025 Moh Syaiful Rahman, Andrianingsih

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Moraref
PKP Index
Indonesia OneSearch
OCLC Worldcat
Index Copernicus
Scilit




















