TL;DR
This paper systematically evaluates whether incorporating language in vision models enhances their generalization, finding that multimodal training does not outperform standard visual training in various tasks.
Contribution
The study provides a comprehensive comparison of multimodal versus vision-only models, showing that current multimodal approaches do not improve generalization capabilities.
Findings
Multimodal training does not outperform vision-only training in clustering, few-shot, transfer, or robustness tasks.
Semantic grounding alone does not enhance vision model generalization.
Further work is needed to leverage language for better vision model performance.
Abstract
Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
