Implementation of Semantic Search in an Academic Repository Using Sentence-BERT and FAISS

Authors

  • Ihsan Lubis Information System, Faculty of Engineering and Computer Science, Universitas Harapan Medan, Indonesia
  • Husni Lubis Information System, Faculty of Engineering and Computer Science, Universitas Harapan Medan, Indonesia
  • Inaya Nur Wahidah Information System, Faculty of Engineering and Computer Science, Universitas Harapan Medan, Indonesia

DOI:

10.33395/sinkron.v10i2.15940

Keywords:

Academic Repository;, Semantic Search;, Sentence-BERT;, FAISS;, Information Retrieval

Abstract

Academic repositories serve as centralized platforms for storing and managing scientific documents, including research papers, reports, and administrative records. Yet, traditional keyword-based search systems often struggle to deliver relevant results. These systems typically fail to capture the contextual meaning of user queries, which leads to mismatches when the query terms differ from those found in the documents. To overcome this limitation, this study introduces a semantic search approach for academic repositories by combining Sentence-BERT as the text embedding model with FAISS as the vector-based similarity search engine. In the proposed system, documents stored in a MySQL database are first preprocessed to remove HTML tags, then converted into semantic vector representations using Sentence-BERT. These vectors are indexed with FAISS, enabling fast and efficient similarity searches compared to conventional keyword matching. The system architecture integrates FastAPI as the backend service for indexing, searching, and evaluation, while CodeIgniter 4 functions as the frontend framework for document management by administrators and end users. Evaluation was carried out using three test sets, each containing ten queries. Performance was measured using Recall@K, normalized Discounted Cumulative Gain (nDCG), Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and search latency. Experimental results show that the system achieved an average Recall@K of 0.61, a MAP of 0.39, and a No-Hit rate of 0.033, meaning all queries successfully retrieved results. Although the nDCG value declined in the third test set, the system consistently returned relevant documents.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Acharya, S., Sornalakshmi, K., Paul, B., & Singh, A. (2022). Question Answering System using NLP and BERT. 3rd International Conference on Smart Electronics and Communication, ICOSEC 2022 - Proceedings, 925–929. https://doi.org/10.1109/ICOSEC54921.2022.9952050

Amur, Z. H., Kwang Hooi, Y., Bhanbhro, H., Dahri, K., & Soomro, G. M. (2023). Short-Text Semantic Similarity (STSS): Techniques, Challenges and Future Perspectives. Applied Sciences (Switzerland), 13(6), 3911. https://doi.org/10.3390/app13063911

Diana, D., & Ekasari, M. H. (2021). Manajemen Tata Kelola Sistem Informasi Dokumentasi Surat Bagian Administrasi Umum Perguruan Tinggi. Jurnal Ilmiah Komputasi, 20(1), 109–116. https://doi.org/10.32409/jikstik.20.1.2702

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P. E., Lomeli, M., Hosseini, L., & Jégou, H. (2025). the Faiss Library. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2025.3618474

Gao, L., Dai, Z., Chen, T., Fan, Z., Van Durme, B., & Callan, J. (2021). Complement Lexical Retrieval Model with Semantic Residual Embeddings. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12656 LNCS, 146–160. https://doi.org/10.1007/978-3-030-72113-8_10

Gardazi, N. M., Daud, A., Malik, M. K., Bukhari, A., Alsahfi, T., & Alshemaimri, B. (2025). BERT applications in natural language processing: a review. Artificial Intelligence Review, 58(6), 1–49. https://doi.org/10.1007/s10462-025-11162-5

Ghali, M.-K., Farrag, A., Won, D., & Jin, Y. (2025). Enhancing knowledge retrieval with in-context learning and semantic search through generative AI. Knowledge-Based Systems, 311, 113047.

Heriani, A. P. S., Wahyudi, I., & Marsehan, A. (2025). Aplikasi Mobile untuk Meningkatkan Efisiensi Administrasi Kampus Universitas PGRI Silampari. Sudo Jurnal Teknik Informatika, 4(2), 64–74. https://doi.org/10.56211/sudo.v4i2.854

Kadang, M., & Nasaruddin, N. (2025). Desain dan Implementasi Sistem Repositori Dokumen Akademik Universitas DIPA Makassar. E-Jurnal JUSITI (Jurnal Sistem Informasi Dan Teknologi Informasi), 14(1), 13–25. https://doi.org/10.36774/jusiti.v14i1.1712

Karri, N., & Jangam, S. K. (2024). Semantic Search with AI Vector Search. International Journal of AI, BigData, Computational and Management Studies, 5(2), 141–150. https://doi.org/10.63282/3050-9416.ijaibdcms-v5i2p114

Khan, M. Q., Shahid, A., Uddin, M. I., Roman, M., Alharbi, A., Alosaimi, W., Almalki, J., & Alshahrani, S. M. (2022). Impact analysis of keyword extraction using contextual word embedding. PeerJ Computer Science, 8, e967. https://doi.org/10.7717/peerj-cs.967

Krisnawati, L. D., Mahastama, A. W., Haw, S. C., Ng, K. W., & Naveen, P. (2024). Indonesian-English Textual Similarity Detection Using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS). CommIT Journal, 18(2), 183–195. https://doi.org/10.21512/commit.v18i2.11274

Kulkarni, H., MacAvaney, S., Goharian, N., & Frieder, O. (2023). Lexically-Accelerated Dense Retrieval. SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 152–162. https://doi.org/10.1145/3539618.3591715

Naqvi, S. M. R., Ghufran, M., Varnier, C., Nicod, J. M., Javed, K., & Zerhouni, N. (2024). Unlocking maintenance insights in industrial text through semantic search. Computers in Industry, 157–158, 104083. https://doi.org/10.1016/j.compind.2024.104083

Patel, Y., Tolias, G., & Matas, J. (2022). Recall@k Surrogate Loss with Large Batches and Similarity Mixup. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June, 7492–7501. https://doi.org/10.1109/CVPR52688.2022.00735

Safira, F. (2021). Kebijakan Open Access Repositori Institusi di Perpustakaan Perguruan Tinggi: Kajian Best Practice Studi Literature. Pustakaloka, 13(1), 116–136. https://doi.org/10.21154/pustakaloka.v13i1.2457

Santander-Cruz, Y., Salazar-Colores, S., Paredes-García, W. J., Guendulain-Arenas, H., & Tovar-Arriaga, S. (2022). Semantic Feature Extraction Using SBERT for Dementia Detection. Brain Sciences, 12(2), 270. https://doi.org/10.3390/brainsci12020270

Tupan, T., & Rahayu, R. N. (2022). Narrative review: faktor-faktor yang berpengaruh terhadap pertumbuhan repositori akses terbuka (open access repositories) di Indonesia. Al-Kuttab : Jurnal Kajian Perpustakaan, Informasi Dan Kearsipan, 4(1), 18–28. http://103.189.235.125/index.php/Kuttab/article/view/4992

Wang, J., Huang, J. X., Tu, X., Wang, J., Huang, A. J., Laskar, M. T. R., & Bhuiyan, A. (2024). Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Computing Surveys, 56(7), 1–33. https://doi.org/10.1145/3648471

Wang, J., Zeng, J., & Sheng, J. (2024). Enhancing and Accelerating Image-Text Retrieval with Knowledge Graphs and FAISS. 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 1–6.

Xing, L. (2024). Secure Official Document Management and intelligent Information Retrieval System based on recommendation algorithm. International Journal of Intelligent Networks, 5, 110–119. https://doi.org/10.1016/j.ijin.2024.02.003

Xiong, H., Bian, J., Li, Y., Li, X., Du, M., Wang, S., Yin, D., & Helal, S. (2024). When Search Engine Services Meet Large Language Models: Visions and Challenges. IEEE Transactions on Services Computing, 17(6), 4558–4577. https://doi.org/10.1109/TSC.2024.3451185

Xu, S., Zhang, C., & Hong, D. (2022). BERT-based NLP techniques for classification and severity modeling in basic warranty data study. Insurance: Mathematics and Economics, 107, 57–67. https://doi.org/10.1016/j.insmatheco.2022.07.013

Yang, W., Chen, J., Zhang, S., Wu, P., Sun, Y., Feng, Y., Chen, C., & Wang, C. (2025). Breaking the Top- K Barrier: Advancing Top- K Ranking Metrics Optimization in Recommender Systems . Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 3542–3552. https://doi.org/10.1145/3711896.3736866

Zhu, P., Lang, Q., & Liu, X. (2023). Word Embedding of Dimensionality Reduction for Document Clustering. Proceedings of the 35th Chinese Control and Decision Conference, CCDC 2023, 4371–4376. https://doi.org/10.1109/CCDC58219.2023.10327354

Zoupanos, S., Kolovos, S., Kanavos, A., Papadimitriou, O., & Maragoudakis, M. (2022). Efficient comparison of sentence embeddings. ACM International Conference Proceeding Series, 1–6. https://doi.org/10.1145/3549737.3549752

Downloads


Crossmark Updates

How to Cite

Lubis, I., Lubis, H., & Nur Wahidah, I. (2026). Implementation of Semantic Search in an Academic Repository Using Sentence-BERT and FAISS. Sinkron : Jurnal Dan Penelitian Teknik Informatika, 10(2), 1060-1069. https://doi.org/10.33395/sinkron.v10i2.15940