Efficient Detection of Duplicate Question Pairs Using Machine Learning and NLP Techniques
Ms. Vaishali Bajpai1 Vaishali Bajpai1
Paper Contents
Abstract
The paper focuses on the development and implementation of a system for the detection of duplicate question pairs, using Machine Learning and Natural Language Processing techniques. Given the proliferation of forums and Q&A sites in the Internet Age, efficient ways to detect the same questions are crucially important for the quality and usability of such platforms. The goal of the paper is to devise a model that identifies correctly whether the semantic equivalence of the two input questions is correct. Various techniques in NLP are applied in preprocessing the text data, which includes tokenization, stemming, lemmatization, and finally vectorization using methods such as TF-IDF. Besides basic text preprocessing, some advanced features are extracted, which includes n-grams and cosine similarity, and keyword extraction. We further enrich our feature set by using the Fuzzy Wuzzy library to develop similarity ratios for question pairs. We further develop different models with Logistic Regression, Support Vector Machines, Random Forest, and Gradient Boosting. The paper performs a rather detailed comparison between all of these models to come up with the best one. These evaluation metrics will include accuracy, precision, recall, and the F1-score. Furthermore, tuning hyperparameters and cross-validation are part of the whole process for model performance optimization.
Copyright
Copyright © 2024 Ms. Vaishali Bajpai1. This is an open access article distributed under the Creative Commons Attribution License.