FUSE : Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation

Tushar Vatsa; Vibha Belavadi; Priya Shanmugasundaram; Suhas Suresha; Dewang Sultania

arXiv:2601.02365·cs.IR·January 7, 2026

FUSE : Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation

Tushar Vatsa, Vibha Belavadi, Priya Shanmugasundaram, Suhas Suresha, Dewang Sultania

PDF

Open Access

TL;DR

FUSE introduces a failure-aware, efficient multimodal search and recommendation system that leverages compact context representations and systematic evaluation to improve accuracy and recall in creative assistant tasks.

Contribution

The paper proposes FUSE, a novel framework that replaces raw images with a compact JSON representation and employs multiple context budgeting strategies for improved multimodal system performance.

Findings

01

Context Compression achieves 93.3% intent accuracy

02

Recall rate of 99.4% across queries

03

Outperforms comprehensive and minimal contextualization strategies

Abstract

Multimodal creative assistants decompose user goals and route tasks to subagents for layout, styling, retrieval, and generation. Retrieval quality is pivotal, yet failures can arise at several stages: understanding user intent, choosing content types, finding candidates (recall), or ranking results. Meanwhile, sending and processing images is costly, making naive multimodal approaches impractical. We present FUSE: Failure-aware Usage of Subagent Evidence for MultiModal Search and Recommendation. FUSE replaces most raw-image prompting with a compact Grounded Design Representation (GDR): a selection aware JSON of canvas elements (image, text, shape, icon, video, logo), structure, styles, salient colors, and user selection provided by the Planner team. FUSE implements seven context budgeting strategies: comprehensive baseline prompting, context compression, chain-of-thought reasoning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques