Natural Language Understanding with the Quora Question Pairs Dataset

Lakshay Sharma; Laura Graesser; Nikita Nangia; Utku Evci

arXiv:1907.01041·cs.CL·July 3, 2019·57 cites

Natural Language Understanding with the Quora Question Pairs Dataset

Lakshay Sharma, Laura Graesser, Nikita Nangia, Utku Evci

PDF

Open Access

TL;DR

This paper investigates duplicate question detection in the Quora dataset, finding that a simple CBOW neural network outperforms complex models, and highlights subjectivity issues in dataset labeling.

Contribution

It demonstrates that a basic CBOW model can outperform advanced neural architectures in duplicate question detection tasks.

Findings

01

CBOW model achieved best performance

02

Simple models can outperform complex ones

03

Subjectivity affects dataset labeling

Abstract

This paper explores the task Natural Language Understanding (NLU) by looking at duplicate question detection in the Quora dataset. We conducted extensive exploration of the dataset and used various machine learning models, including linear and tree-based models. Our final finding was that a simple Continuous Bag of Words neural network model had the best performance, outdoing more complicated recurrent and attention based models. We also conducted error analysis and found some subjectivity in the labeling of the dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems