Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems

Hengfan Zhang; Yueqian Lin; Hai Helen Li; Yiran Chen

arXiv:2601.15676·cs.SD·January 23, 2026

Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems

Hengfan Zhang, Yueqian Lin, Hai Helen Li, Yiran Chen

PDF

Open Access

TL;DR

This paper introduces CoFi-Agent, a hybrid edge-cloud system that enhances audio perception accuracy and efficiency by combining local quick analysis with cloud-assisted detailed reasoning, reducing latency and privacy risks.

Contribution

It presents a novel lightweight hybrid architecture that performs local perception and selectively triggers cloud-based forensic refinement for edge audio systems.

Findings

01

Significantly improves accuracy from 27.20% to 53.60% on MMAR benchmark.

02

Achieves better accuracy-efficiency trade-off than always-on investigation pipelines.

03

Demonstrates effective edge-cloud collaboration under practical system constraints.

Abstract

Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Digital Media Forensic Detection · Speech Recognition and Synthesis