The effect of Chi-Square Feature Selection on Question Classification using Multinomial Naïve Bayes

Novi  Yusliani; Syechky Al Qodrin  Aruda; Mastura Diana  Marieska; Danny Mathew  Saputra; Abdiansah Abdiansah

doi:10.33395/sinkron.v7i4.11788

Authors

Novi Yusliani Universitas Sriwijaya, Indonesia
Syechky Al Qodrin Aruda Universitas Sriwijaya, Indonesia
Mastura Diana Marieska Universitas Sriwijaya, Indonesia
Danny Mathew Saputra Universitas Sriwijaya, Indonesia
Abdiansah Universitas Sriwijaya, Indonesia

DOI:

10.33395/sinkron.v7i4.11788

Abstract

Question classification is one of the essential tasks for question answering system. This task will determine the expected answer type (EAT) of the question given to the system. Multinomial Naïve Bayes algorithm is one of the learning algorithms that can be used to classify questions. At the classification stage, this algorithm used a set of features in the knowledge model. The number of features used can result in curse of dimensionality if the feature is in high dimension. Feature selection can be used to reduce the feature dimension and could increase the system performance. Chi-Square algorithm can be used to select features that describe each category. In this research, the Multinomial Naïve Bayes is used to classify the question sentences and the Chi-Square algorithm is used for the feature selection. The dataset used is a set of Indonesian question sentences, consisting of 519 labeled factoids, 491 labeled non-factoids, and 185 labeled other. The test results showed an increase in accuracy of 0.1 when used feature selection. System accuracy when used feature selection is 0.87 with the number of features used are 248. Without feature selection, the accuracy is 0.77 with the number of features used are 1374.

GS Cited Analysis

Downloads

Download data is not yet available.

References

Abdurrahman Farisi, A., Sibaroni, Y., & Al Faraby, S. (2019). Sentiment Analysis on Hotel Reviews using Multinomial Naïve Bayes Classifier. The 2nd International Conference on Data and Information Science.

Bahassine, S., Madani, A., Al-Sarem., M., & Kissi, M. (2020). Feature Selection using An Improved Chi-Square for Arabic Text Classification. Journal of King Saud University – Computer and Information Sciences, vol.32(2), 225-231. https://doi.org/10.1016/j.jksuci.2018.05.010.

Bermejo, P., Games, J., A., & Puerta, J., M. (2011). Improving The Performance of Naïve Bayes Multinomial in e-mail Foldering by Introducing Distribution-based Balance of Datasets. Expert Systems with Applications, vol. 38(3), 2072-2080. https://doi.org/10.1016/j.eswa.2010.07.146.

Cascaro, R. J., Gerardo, B. D., & Medin, R. P. (2019). Aggregating Filter Feature Selection Methods to Enhance Multiclass Text Classification. Proceedings of the 2019 7th International Conference on Information Technology: IoT and Smart City, 80-84. https://doi.org/10.1145/3377170.3377209.

Chen, R., Dewi, C., Huang, S., & Eko Caraka, R. (2020). Selection Critical Features for Data Classification Based on Machine Learning Methods. Journal of Big Data, 7:52. https://doi.org/10.1186/s40537-020-00327-4.

Debie, E., & Shafi, K. (2019). Implications of the Curse of Dimensionality for Supervised Learning Classifier Systems: Theoretical and Empirical Analyses. Pattern Analysis & Applications, vol. 22, issue 2, 519-536.

Derici, C., Celik, K., Kutbay, E., Aydin, Y., Gungor, T., Ozgur, A., & Kartal, G. (2015). Question Analysis for A Closed Domain Question Answering System. Springer International Publishing. Doi: 10.1007/978-3-319-18117-2_35.

Faiz Ur Rahman Khilji, A., Manna, R., Rahman Laskar, S., Pakray, P., Das, D., Bandyopadhyay, S., & Gelbukh, A. (2020). Question Classification and Answer Extraction for Developing a Cooking QA System. Computacion y Sistemas, vol. 24, no.2, 927-933.

Hanifah, A. F., & Kusumaningrum, R. (2021). Non-Factoid Answer Selection in Indonesian Science Question Answering System using Long Short-Term Memory (LSTM). Procedia Computer Science, vol.179, 736-746. https://doi.org/10.1016/j.procs.2021.01.062.

Herrera, J., Parra, D., & Poblete, B. (2019). Social QA in Non-CQA Platforms. Future Generation Computer Science, vol.105, 631-649. https://doi.org/10.1016/j.future.2019.12.023.

Jiang, L., Wang, S., Li, Z., & Zhang, L. (2016). Structure Extended Multinomial Naïve Bayes. Information Sciences Journal, vol. 329, 346-356.

Jin, C., Ma, T., Hou, R., Tang. M., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2015). Chi-Square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization. IETE Journal of Research, vol. 61(4), 351-362, https://doi.org/10.1080/03772063.2015.1021385.

Manikandan, G., & Abirami, S. (2021). Feature Selection and Machine Learning Models for High-Dimensional Data: State-of-the-Art. Computational Intelligence and Healthcare Informatics, Wiley Online Library.

Mazyad, A., Teytaud, F., & Fonlupt, C. (2017). A Comparative Study on Term Weighting Schemes for Text Classification. The 3rd International Conference on Machine Learning, Optimization, and Big Data, Tuscany, Italy, 100-108. Doi: 10.1007/978-3-319-72926-8_9.

Peker, N., & Kubat, C. (2021). Application of Chi-Square Discretization Algorithms to Ensemble Classification Methods. Expert Systems With Applications, vol. 185. https://doi.org/10.1016/j.eswa.2021.115540.

Purwarianti, A., & Yusliani, N. (2011). Sistem Question Answering Bahasa Indonesia untuk Pertanyaan Non-Factoid. Jurnal Ilmu Komputer dan Informasi, vol.4, no.1, 10-14. https://doi.org/10.21609/jiki.v4i1.151.

Santhi, B., & Brindha, G.R. (2019). Multinomial Naïve Bayes using Similarity Based Conditional Probability. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, vol. 36, issue 2, 1431-1441. https://doi.org/10.3233/JIFS-181009.

Singh, G., Kumar, B., Gaur, L., & Tyagi, A. (2019). Comparison between Multinomial and Bernoulli Naïve Bayes for Text Classification. International Conference on Automation, Computational, and Technology Management (ICACTM), 593-596.

Sitepu, B. S., Munthe, I. R., & Harahap, Z. S. (2022). Implementation of Support Vector Machine Algorithm for Shopee Customer Sentiment Analysis. Sinkron: Jurnal dan Penelitian Teknik Informatika, vol.7(2). https://doi.org/10.33395/sinkron.v7i2.11408.

Syahputra, R., Yanris, G. J., & Irmayani, D. (2022). SVM and Naïve Bayes Algorithm Comparison for User Sentiment Analysis on Twitter. Sinkron: Jurnal dan Penelitian Teknik Informatika, vol.7(2). https://doi.org/10.33395/sinkron.v7i2.11430.

Van-Tu, N., & Anh-Cuong, Le. (2016). Improving Question Classification by Feature Extraction and Selection. Indian Journal of Science and Technology, vol.9(17). doi : 10.17485/ijst/2016/v9i17/93160.

Zai, Y., Song, W., Liu, X., Liu, L., & Zhao, X. (2018). A Chi-Square Statistics Based Feature Selection Method in Text Classification. IEEE 9th International Conference on Software Engineering and Service Science.

Zulqarnain, M., Khalaf Zager Alsaedi, A., Ghazali, R., Ghouse, MG., Sharif, W., Aida Husaini, N. (2021). A comparative analysis on question classification task based on deep learning approaches. PeerJ Computer Science 7:e570 https://doi.org/10.7717/peerj-cs.570.

	CONTACT US
	EDITORIAL BOARD
	AIMS & SCOPE
	COPYRIGHT & LICENSE
	REVIEWER
	FACEBOOK FANPAGE
	AUTHOR PROCESSING CHARGE
	OPEN ACCESS POLICY
	TEMPLATE
	PEER REVIEW PROCESS
	PUBLICATION ETHICS
	STATISTIC VIEWER
	ARCHIVING
	CROSSMARK POLICY
	FREQUENCY
	PLAGIARISM POLICY
	AUTHOR GUIDELINES
	HISTORY
	CALL REVIEWER

The effect of Chi-Square Feature Selection on Question Classification using Multinomial Naïve Bayes

Authors

DOI:

Abstract

Downloads

References

Downloads

How to Cite

Issue

Section

License

Information

Current Issue

Make a Submission

Developed By