LongFuncEval: Measuring the effectiveness of long context models for function calling

Kiran Kate; Tejaswini Pedapati; Kinjal Basu; Yara Rizk; Vijil Chenthamarakshan; Subhajit Chaudhury; Mayank Agarwal; Ibrahim Abdelaziz

arXiv:2505.10570·cs.SE·May 19, 2025

LongFuncEval: Measuring the effectiveness of long context models for function calling

Kiran Kate, Tejaswini Pedapati, Kinjal Basu, Yara Rizk, Vijil Chenthamarakshan, Subhajit Chaudhury, Mayank Agarwal, Ibrahim Abdelaziz

PDF

Open Access

TL;DR

This paper evaluates how well large language models perform in calling external tools within long context conversations, revealing significant performance drops as context length and complexity increase, and highlighting the need for further improvements.

Contribution

It is the first comprehensive study of long context understanding in LLMs specifically for tool calling, including new benchmarks and analysis of performance degradation.

Findings

01

Performance drops up to 85% with more tools

02

Answer retrieval degrades up to 91% with longer responses

03

Multi-turn conversation accuracy decreases by up to 40%

Abstract

Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another interesting problem: LLMs' abilities to accurately perform function calls in long context settings. Particularly, when calling tools, LLMs are encumbered by three predominant challenges: (1) a large catalog of tools, (2) long responses from the tool APIs, and (3) long multi-turn conversations. These challenges are particularly relevant to enterprise applications of LLMs which engage in multi-turn conversations with users to complete complex tasks that require a large catalog of complex tools. The literature contains multiple investigations of long context challenges such as lost in the middle or needle in the haystack for natural language tasks. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Personal Information Management and User Behavior