TL;DR
FusionAgent introduces a dynamic, sample-specific model selection framework using a multimodal large language model and reinforcement fine-tuning to improve human recognition accuracy and efficiency.
Contribution
It presents a novel agentic framework that adaptively selects models for each sample, addressing static fusion limitations and enhancing recognition performance.
Findings
Outperforms state-of-the-art methods on biometric benchmarks.
Achieves higher efficiency with fewer model invocations.
Demonstrates robustness and explainability in model fusion.
Abstract
Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
