Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection

Joe Stacey; Lisa Alazraki; Aran Ubhi; Beyza Ermis; Aaron Mueller; Marek Rei

arXiv:2505.20209·cs.CL·January 21, 2026

Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection

Joe Stacey, Lisa Alazraki, Aran Ubhi, Beyza Ermis, Aaron Mueller, Marek Rei

PDF

Open Access 1 Video

TL;DR

This paper explores how strategic data selection, including prioritizing complex examples and generating synthetic data, can enhance the out-of-distribution robustness of fine-tuned large language models on NLI tasks, especially for closed-source models.

Contribution

It introduces a data selection strategy that improves OOD robustness of LLMs on NLI without altering the fine-tuning process or requiring large-scale data augmentation.

Findings

01

Prioritizing complex examples improves performance on challenging OOD datasets.

02

Synthetic data generation enhances robustness on easier OOD datasets.

03

Autoregressive LLMs are more robust to distributional shifts than encoder models.

Abstract

We investigate the robustness of fine-tuned Large Language Models (LLMs) for the task of Natural Language Inference (NLI), finding that the in-distribution gains from fine-tuning correspond to a large drop in out-of-distribution (OOD) performance. Despite the widespread use of closed-source LLMs, there are no robustness mitigation methods that work under their API fine-tuning constraints. Existing methods to improve robustness typically require changing the fine-tuning process or large-scale data augmentation, methods that are infeasible or cost prohibitive for closed-source models. To address this, we propose strategically selecting the NLI fine-tuning data, prioritising more complex examples or replacing existing training examples with LLM-generated data. Prioritising more complex training examples improves performance on challenging OOD NLI datasets, while training with synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection· underline

Taxonomy

TopicsFault Detection and Control Systems

MethodsSparse Evolutionary Training