MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal

TL;DR
MMFactory is a versatile framework that searches for optimal vision-language solutions by combining models, considering user constraints, and generating diverse, executable programs tailored to specific visual tasks.
Contribution
It introduces a universal solution search engine for vision-language tasks that incorporates model routing, resource constraints, and multi-agent solution synthesis, outperforming existing methods.
Findings
Outperforms existing methods with state-of-the-art solutions.
Provides diverse, resource-aware programmatic solutions.
Enables user-specific customization for visual tasks.
Abstract
With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Constraint Satisfaction and Optimization · Natural Language Processing Techniques
