LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Shawn Gavin; Tuney Zheng; Jiaheng Liu; Quehry Que; Noah Wang; Jian Yang; Chenchen Zhang; Wenhao Huang; Ge Zhang

arXiv:2406.17588·cs.CL·August 14, 2025

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Ge Zhang

PDF

Open Access

TL;DR

The paper introduces LongIns, a new benchmark for evaluating long-context reasoning in LLMs, revealing limitations in current models' performance with extended contexts and multi-hop tasks.

Contribution

It presents the LongIns benchmark with three evaluation settings, highlighting the gap between claimed and actual context handling abilities of LLMs.

Findings

01

GPT-4 with 128k context performs poorly on 16k context tasks.

02

Most LLMs struggle with multi-hop reasoning under 4k context windows.

Abstract

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHigher Education Learning Practices · Information Systems Education and Curriculum Development · Open Education and E-Learning

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer