Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical   Report

Franz Louis Cesista

arXiv:2406.11403·cs.CV·February 6, 2025

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Franz Louis Cesista

PDF

Open Access 1 Repo

TL;DR

This paper introduces Multimodal Structured Generation, a framework that enforces structured outputs in multimodal foundation models, improving performance and interpretability without extensive fine-tuning, demonstrated through the CVPR MMFM Challenge.

Contribution

The paper presents a novel method to produce structured, parseable outputs from frozen multimodal models using hard constraints, reducing the need for costly fine-tuning.

Findings

01

Structured generation improves downstream API integration.

02

The approach outperforms complex models with lightweight engineering.

03

Significant performance gains achieved without fine-tuning.

Abstract

Multimodal Foundation Models (MMFMs) have demonstrated strong performance in both computer vision and natural language processing tasks. However, their performance diminishes in tasks that require a high degree of integration between these modalities, such as document understanding. Moreover, finetuning these models and deploying them requires significantly more compute and more engineering effort than unimodal models. In this work, we present Multimodal Structured Generation, a framework that forces (frozen) MMFMs to produce outputs in a strictly structured format by applying hard constraints directly to the output logits. This approach not only ensures that the model generates parseable outputs that downstream APIs can easily ingest but also allows us to force the model to reason before answering, which significantly boosts performance without the need for expensive fine-tuning. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leloykun/mmfm-challenge
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Civil and Geotechnical Engineering Research

MethodsSparse Evolutionary Training