MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

Kai Yan; Zhan Ling; Kang Liu; Yifan Yang; Ting-Han Fan; Lingfeng Shen; Zhengyin Du; Jiecao Chen

arXiv:2502.09933·cs.AI·October 24, 2025

MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

PDF

Open Access 1 Datasets 1 Video

TL;DR

MIR-Bench is a new benchmark designed to evaluate large language models' ability to perform complex, many-shot in-context reasoning across diverse data formats, revealing insights into their reasoning capabilities and limitations.

Contribution

This paper introduces MIR-Bench, the first benchmark for many-shot in-context reasoning in pattern recognition, addressing a gap in existing evaluations for long-context, complex reasoning tasks.

Findings

01

Scaling effect observed in model performance

02

Robustness of models varies with task complexity

03

Inductive vs. transductive reasoning shows different strengths

Abstract

The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kaiyan289/MIR-Bench
dataset· 397 dl
397 dl

Videos

MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?· slideslive

Taxonomy

TopicsArtificial Intelligence in Law

MethodsFocus