Natural Language Understanding with the Quora Question Pairs Dataset
Lakshay Sharma, Laura Graesser, Nikita Nangia, Utku Evci

TL;DR
This paper investigates duplicate question detection in the Quora dataset, finding that a simple CBOW neural network outperforms complex models, and highlights subjectivity issues in dataset labeling.
Contribution
It demonstrates that a basic CBOW model can outperform advanced neural architectures in duplicate question detection tasks.
Findings
CBOW model achieved best performance
Simple models can outperform complex ones
Subjectivity affects dataset labeling
Abstract
This paper explores the task Natural Language Understanding (NLU) by looking at duplicate question detection in the Quora dataset. We conducted extensive exploration of the dataset and used various machine learning models, including linear and tree-based models. Our final finding was that a simple Continuous Bag of Words neural network model had the best performance, outdoing more complicated recurrent and attention based models. We also conducted error analysis and found some subjectivity in the labeling of the dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
