Asymmetric Idiosyncrasies in Multimodal Models
Muzi Tao, Chufan Shi, Huijuan Wang, Shengbang Tong, Xuezhe Ma

TL;DR
This paper investigates the stylistic signatures of caption models and their influence on text-to-image models, revealing that while captions retain distinctive styles, these signatures largely vanish in generated images, highlighting cross-modal discrepancies.
Contribution
The study introduces a systematic classification framework to quantify stylistic signatures in caption models and their transfer (or loss) in text-to-image generation.
Findings
Caption models embed distinctive stylistic signatures with 99.70% accuracy.
Generated images retain minimal stylistic signatures, with accuracy dropping to 50%.
Generated images often fail to preserve key variations present in captions.
Abstract
In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques
