Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

Herman Bergstr\"om; Aditya Mehrotra; Rahul G. Krishnan

arXiv:2605.20674·cs.LG·May 21, 2026

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

Herman Bergstr\"om, Aditya Mehrotra, Rahul G. Krishnan

PDF

TL;DR

CoMET is a simple, out-of-the-box multimodal classification method that combines frozen pre-trained backbones, PCA, and a tabular foundation model, achieving state-of-the-art results without training.

Contribution

The paper introduces CoMET, a novel compositional approach that leverages frozen backbone encoders, PCA, and tabular foundation models for effective multimodal classification without fine-tuning.

Findings

01

Achieves state-of-the-art results across diverse benchmarks.

02

Handles large-scale hierarchical classification with over 500,000 samples.

03

Does not require any training or fine-tuning of backbone models.

Abstract

We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.