GPT-5 vs Other LLMs in Long Short-Context Performance

Nima Esmi (1; 2); Maryam Nezhad-Moghaddam (3); Fatemeh Borhani (3); Asadollah Shahbahrami (2; 3); Amin Daemdoost (3); Georgi Gaydadjiev (4) ((1) Bernoulli Institute; RUG; Groningen; Netherlands; (2) ISRC; Khazar University; Baku; Azerbaijan; (3) Department of Computer Engineering; University of Guilan; Rasht; Iran; (4) QCE Department; TU Delft; Delft; Netherlands)

arXiv:2602.14188·cs.CL·February 25, 2026

GPT-5 vs Other LLMs in Long Short-Context Performance

Nima Esmi (1, 2), Maryam Nezhad-Moghaddam (3), Fatemeh Borhani (3), Asadollah Shahbahrami (2, 3), Amin Daemdoost (3), Georgi Gaydadjiev (4) ((1) Bernoulli Institute, RUG, Groningen, Netherlands, (2) ISRC, Khazar University, Baku, Azerbaijan

PDF

Open Access

TL;DR

This paper evaluates the long-context processing capabilities of GPT-5 and other top LLMs, revealing significant performance drops with large input volumes but high precision in GPT-5, highlighting the gap between theoretical capacity and practical use.

Contribution

It provides a comparative analysis of GPT-5 and other models on long-context tasks, demonstrating improvements in handling large inputs and addressing the 'lost in the middle' problem.

Findings

01

Performance degrades significantly beyond 5K posts

02

GPT-5 maintains high precision (~95%) despite accuracy drop

03

The 'lost in the middle' problem is largely resolved in newer models

Abstract

With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Digital Mental Health Interventions · Sentiment Analysis and Opinion Mining