Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Yichao Wu; Penghao Liang; Yafei Xiang; Mengwei Yuan; Jianan Liu; Jing Yang; Xianyou Li; Weiran Yan

arXiv:2603.00846·cs.IR·March 3, 2026

Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models

Yichao Wu, Penghao Liang, Yafei Xiang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan

PDF

Open Access

TL;DR

Tiny-Critic RAG introduces a cost-effective, low-latency approach for agentic fallback in retrieval-augmented generation by using a parameter-efficient small language model as a deterministic gatekeeper, reducing reliance on large models.

Contribution

The paper presents Tiny-Critic RAG, a novel framework that employs a small language model with LoRA for efficient evaluation and routing, replacing large models in agentic RAG systems.

Findings

01

Achieves routing accuracy comparable to GPT-4o-mini.

02

Reduces latency by an order of magnitude.

03

Demonstrates high cost-effectiveness for agent deployment.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis