TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

Mohd Ariful Haque (1); Fahad Rahman (2); Kishor Datta Gupta (1); Khalil Shujaee (1); Roy George (1) ((1) Clark Atlanta University; (2) United International University)

arXiv:2511.22138·cs.LG·December 1, 2025

TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

Mohd Ariful Haque (1), Fahad Rahman (2), Kishor Datta Gupta (1), Khalil Shujaee (1), Roy George (1) ((1) Clark Atlanta University, (2) United International University)

PDF

Open Access

TL;DR

This paper evaluates and optimizes small language models for agentic tasks on edge devices, demonstrating that hybrid optimization strategies significantly improve their accuracy and stability for autonomous, privacy-preserving AI applications.

Contribution

It introduces a comprehensive evaluation framework and hybrid optimization methods for small language models, enabling effective agentic tasks on edge devices.

Findings

01

Medium-sized models outperform ultra-compact models in accuracy.

02

Hybrid optimization achieves up to 65.74% overall accuracy.

03

Small models can be effective for autonomous AI on edge devices.

Abstract

This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning and Data Classification · Big Data and Digital Economy