A MISMATCHED Benchmark for Scientific Natural Language Inference

Firoz Shaik; Mobashir Sadat; Nikita Gautam; Doina Caragea; Cornelia Caragea

arXiv:2506.04603·cs.CL·June 6, 2025

A MISMATCHED Benchmark for Scientific Natural Language Inference

Firoz Shaik, Mobashir Sadat, Nikita Gautam, Doina Caragea, Cornelia Caragea

PDF

Open Access 1 Repo

TL;DR

This paper introduces MISMATCHED, a new benchmark for scientific NLI across non-CS domains, highlighting the challenge and potential for future model improvements.

Contribution

The paper presents MISMATCHED, a novel benchmark dataset for scientific NLI in psychology, engineering, and public health, with baseline results and insights for future research.

Findings

01

Baseline models achieve a Macro F1 of 78.17% on MISMATCHED.

02

Including implicit NLI relations in training improves model performance.

03

Significant room for improvement in scientific NLI models.

Abstract

Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fshaik8/mismatched
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Mental Health via Writing