Cross-Modal Fine-Tuning: Align then Refine
Junhong Shen, Liam Li, Lucio M. Dery, Corey Staten, Mikhail Khodak,, Graham Neubig, Ameet Talwalkar

TL;DR
ORCA is a versatile cross-modal fine-tuning framework that aligns and refines pretrained models to perform well across diverse modalities and datasets, achieving state-of-the-art results.
Contribution
It introduces a novel align-then-refine workflow enabling a single pretrained model to adapt effectively to multiple modalities.
Findings
State-of-the-art results on 3 benchmarks with 60+ datasets from 12 modalities
Effective data alignment improves performance, especially in data-limited scenarios
Outperforms various hand-designed, AutoML, and task-specific methods
Abstract
Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
