FOCAL: A Novel Benchmarking Technique for Multi-modal Agents
Anupam Purwar, Aditya Choudhary

TL;DR
FOCAL is a new benchmarking framework designed to evaluate multi-modal voice and text agents, focusing on reasoning, error propagation, and conversation quality, with novel metrics for assessing agent efficacy.
Contribution
It introduces FOCAL, a comprehensive benchmarking framework with new metrics for analyzing reasoning and semantic quality in multi-modal agents.
Findings
Effective end-to-end reasoning evaluation
Component-wise error analysis capabilities
Novel Reasoning and Semantic scores for conversation quality
Abstract
With the recent advancements in reasoning capabilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis
