ODIA: Oriented Distillation for Inline Acceleration of LLM-based Function Calling
Hanlong Zhang, Jingsheng Yang, Hao Li, Yuhao He, Franck Gong

TL;DR
ODIA introduces an online distillation technique that leverages user interaction data to significantly reduce latency in LLM-based function calling, enabling faster responses with minimal accuracy loss in real-world applications.
Contribution
The paper proposes a novel online distillation method that automatically identifies simple queries and distills knowledge from larger models to smaller ones for inline acceleration.
Findings
Reduces response latency by up to 78% in median cases.
Handles 60% of traffic with a smaller model in a real-world music app.
Requires minimal human intervention and improves through automated updates.
Abstract
Function Calling is a crucial technique that enables Large Language Models (LLMs) to interact with external systems through APIs. However, the high latency associated with LLM-based Function Calling significantly impacts user experience. This paper presents a novel approach called Oriented Distillation for Inline Acceleration (ODIA) that leverages online user interaction data to accelerate Function Calling. By automatically identifying "simple queries" from production traffic and distilling knowledge from larger models to smaller ones, our method reduces response latency by 45% (expected) and 78% (median) while maintaining accuracy. We demonstrate the effectiveness of our approach through real-world deployment in a music application, where the smaller model successfully handles 60% of traffic with negligible accuracy loss. Our method requires minimal human intervention and continuously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTarget Tracking and Data Fusion in Sensor Networks · Magnetic confinement fusion research · Speech Recognition and Synthesis
