A proposed approach for plagiarism detection in Article documents
DOI:
10.33395/sinkron.v7i2.11381Keywords:
Plagiarism , Plagiarism Detection, Clustering , TFIDF, Cosine similarityAbstract
According to the scientific institutes, Plagiarism is defined as claiming someone else's ideas or efforts as one's own without citing the sources. Systems of plagiarism detection typically use a text similarity algorithm in a text document to look for common sentences between source and suspicious documents, either by directly matching the sentences or by embedding the sentences into a vector using TFIDF-like or other methods and then calculating the distance or the similarity between the source and suspect sentence vectors. The cosine similarity method is one of the methods for determining that distance. To cluster the documents and choose only related documents for detection, an unsupervised Machine learning technique such as K-means could be utilized. In this paper, a plagiarism detecting application was created and tested on many text document types, including doc, Docx, and pdf of research papers that were collected from the web to build the source corpus. To calculate the level of similarity between the suspicious article and the corpus of source articles, the TFIDF text encoding approach is used with NLP, K-means clustering, and cosine similarity algorithms. The proposed application was carried out with five different documents and resulted in different ratios of plagiarism, the first document has a 0.27 ratio, the second document has a 0.15 ratio, the third document has 0.19 ratio while document 4 has a 0.42 ratio, and finally, document 5 has 0.37 ratio of plagiarism. The generated detailed plagiarism ratio report presents the percentage of plagiarism in the suspicious article document. Depending on the threshold value, the application will decide if the suspicious document is acceptable or not.
Downloads
References
Ahmed , A. E., Mohamed, G., Amar , F., Basma , M., Omar , M., & Mohamed , S. (2021). Plagiarism Detection Algorithm Model Based on NLP Technology. Journal of Cybersecurity and Information Management (JCIM) 5(1), 43-61.
AL-Jibory, F. K. (2021). Hybrid System for Plagiarism Detection on A Scientific Paper. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(13), 5707-5719.
Balani, Z., & Varol, C. (2021). Combining Approximate String Matching Algorithms and Term Frequency In The Detection of Plagiarism. International Journal of Computer Science and Security (IJCSS), 15(4), 97-106.
da Costa, E., & Mali, V. S. (2021). Tetun Language Plagiarism Detection With Text Mining Approach Using N-gram and Jaccard Similarity Coefficient. Timor-Leste Journal of Engineering and Science, 2, 11-20.
Gunawan, D., Sembiring, C., & Budiman, M. A. (2018). The implementation of cosine similarity to calculate text relevance between two documents. Paper presented at the Journal of physics: conference series.
Gupta, D. (2016). Study on Extrinsic Text Plagiarism Detection Techniques and Tools. Journal of Engineering Science & Technology Review, 9(5).
Hiten , C., Mohd. , T., Rutuja , K., & Nikita , C. (2021). Plagiarism Detector Using Machine Learning, International Journal of Research in Engineering, Science and Management, 4(4).
Hunt, E., Janamsetty, R., Kinares, C., Koh, C., Sanchez, A., Zhan, F., Dahal, B. (2019). Machine learning models for paraphrase identification and its applications on plagiarism detection. Paper presented at the 2019 IEEE International Conference on Big Knowledge (ICBK).
Jiffriya, M., Jahan, M. A., & Ragel, R. (2021). Plagiarism detection tools and techniques: A comprehensive survey. Journal of Science-FAS-SEUSL, 2(02), 47-64.
Kharat, R., Chavan, P. M., Jadhav, V., & Rakibe, K. (2013). Semantically Detecting Plagiarism for Research Papers. International Journal of Engineering Research and Applications, 3, 077-080.
Lahitani, A. R., Permanasari, A. E., & Setiawan, N. A. (2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. Paper presented at the 2016 4th International Conference on Cyber and IT Service Management.
Lydia, E. L., Govindaswamy, P., Lakshmanaprabu, S., & Ramya, D. (2018). Document clustering based on text mining K-means algorithm using euclidean distance similarity. Journal of Advanced Research in Dynamical & Control Systems, 10(02-Special Issue).
Marjai, P., Lehotay-Kéry, P., & Kiss, A. (2021). Document similarity for error prediction. Journal of Information and Telecommunication, 5(4), 407-420.
Nurlybayeva, S., Akhmetov, I., Gelbukh, A., & Mussabayev, R. (2021). Plagiarism Detection in Students’ Answers Using FP-Growth Algorithm, Cham.
Resta, O. A., Aditya, A., & Purwiantono, F. E. (2021). Plagiarism Detection in Students' Theses Using The Cosine Similarity Method. Sinkron: jurnal dan penelitian teknik informatika, 5(2), 305-313.
Rosu, R., Stoica, A. S., Popescu, P. S., & Mihăescu, M. C. (2021). NLP based Deep Learning Approach for Plagiarism Detection. Paper presented at the RoCHI - International Conference on Human-Computer Interaction, Romania.
Usino, W., Prabuwono, A. S., Allehaibi, K. H. S., Bramantoro, A., Hasniaty, A., & Amaldi, W. (2019). Document Similarity Detection using K-Means and Cosine Distance. International Journal of Advanced Computer Science and Applications.
Vani, K., & Gupta, D. (2014). Using K-means cluster based techniques in external plagiarism detection. Paper presented at the 2014 international conference on contemporary computing and informatics (IC3I).
Wadud, M. A. H., Mridha, M. F., & Rahman, M. M. (2022). Word Embedding Methods for Word Representation in Deep Learning for Natural Language Processing. Iraqi Journal of Science, 63(3), 1349-1361. doi:10.24996/ijs.2022.63.3.37
Zen, B. P., Susanto, I., & Finaliamartha, D. (2021). TF-IDF Method and Vector Space Model Regarding the Covid-19 Vaccine on Online News. Sinkron: jurnal dan penelitian teknik informatika, 6(1), 69-79.
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2022 Ayoub Ali M. Saeed , Alaa Yaseen Taqa
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.