TL;DR
MARVIS is a versatile system that transforms various modality data into visualizations, enabling large vision-language models to perform well across diverse domains without domain-specific training.
Contribution
It introduces a modality adaptive reasoning approach that leverages visualizations and VLMs, achieving competitive results across multiple domains with a single model.
Findings
Outperforms Gemini 2.0 by 16% on average across domains.
Achieves competitive performance in vision, audio, biological, and tabular data.
Reduces the gap between generalist models and specialized domain methods.
Abstract
Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular domains or modalities. Such models often achieve excellent performance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a system that transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for predictions successfully. MARVIS achieves competitive performance across vision, audio, biological, and tabular domains using a single 3B parameter model, yielding results that beat Gemini 2.0 by 16% on average. MARVIS drastically reduces the gap between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
