JUCS - Journal of Universal Computer Science 26(6): 671-697, doi: 10.3897/jucs.2020.036

Question to Question Similarity Analysis Using Morphological, Syntactic, Semantic, and Lexical Features

Mahmoud M. Hammad^‡, Mohammad Al-Smadi^‡, Qanita Bani Baker^‡, Muntaha Al-asa D^‡, Nour Al-Khdour^‡, Mutaz Bni Younes^‡, Enas Khwaileh^§

‡ Jordan University of Science and Technology, Irbid, Jordan§ Jordan University of Science and Technology, Irbid

Corresponding author: Mahmoud Hammad ( m-hammad@just.edu.jo )

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY-ND 4.0). This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

Citation: Hammad MM, Al-Smadi M, Baker QB, D MA-a, Al-Khdour N, Younes MB, Khwaileh E (2020) Question to Question Similarity Analysis Using Morphological, Syntactic, Semantic, and Lexical Features. JUCS - Journal of Universal Computer Science 26(6): 671-697. https://doi.org/10.3897/jucs.2020.036

Abstract

In the digitally connected world that we are living in, people expect to get answers to their questions spontaneously. This expectation increased the burden on Question/Answer platforms such as Stack Overflow and many others. A promising solution to this problem is to detect if a question being asked is similar to a question in the database, then present the answer of the detected question to the user. To address this challenge, we propose a novel Natural Language Processing (NLP) approach that detects if two Arabic questions are similar or not using their extracted morphological, syntactic, semantic, lexical, overlapping, and semantic lexical features. Our approach involves several phases including Arabic text processing, novel feature extraction, and text classifications. Moreover, we conducted a comparison between seven different machine learning classifiers. The included classifiers are: Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), Extreme Gradient Boosting (XGB), Random Forests (RF), Adaptive Boosting (AdaBoost), and Multilayer Perceptron (MLP). To conduct our experiments, we used a real-world questions dataset consisting of around 19,136 questions (9,568 pairs of questions) in which our approach achieved 82.93% accuracy using our XGB model on the best features selected by the Random Forest feature selection technique. This high accuracy of our model shows the ability of our approach to correctly detect similar Arabic questions and hence increases user satisfactions.

Keywords

Arabic language, NLP, Semantic Text Similarity (STS), machine learning, text classification, lexical features, XGB, LDA