Pointwise Paraphrase Appraisal is Potentially Problematic
Hannah Chen, Yangfeng Ji, David Evans

TL;DR
This paper critiques the standard pointwise evaluation method for paraphrase models, revealing that high-performing models may fail on simple tasks and produce counterintuitive results, highlighting the need for better evaluation strategies.
Contribution
The paper demonstrates the limitations of pointwise evaluation for paraphrase models and proposes the need for alternative assessment methods.
Findings
State-of-the-art BERT models perform poorly on identical sentence pairs.
Models may assign higher scores to random sentence pairs than identical ones.
Pointwise evaluation may not reflect real-world paraphrase detection performance.
Abstract
The prevailing approach for training and evaluating paraphrase identification models is constructed as a binary classification problem: the model is given a pair of sentences, and is judged by how accurately it classifies pairs as either paraphrases or non-paraphrases. This pointwise-based evaluation method does not match well the objective of most real world applications, so the goal of our work is to understand how models which perform well under pointwise evaluation may fail in practice and find better methods for evaluating paraphrase identification models. As a first step towards that goal, we show that although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsLinear Layer · Weight Decay · Softmax · Adam · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections
