MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
Sonali Godavarthy, Matthias Neuwirth-Trapp, Tim-Felix Faasch, Maarten Bieshaar, Michael Moeller, and Danda Pani Paudel

TL;DR
This paper introduces MULTI, a method for disentangling imaging factors like camera lens, sensor, and viewpoint in text-to-image models, enabling more precise control and novel image generation.
Contribution
It proposes a two-stage approach for learning and extracting imaging factors, addressing limitations of current content-focused models and expanding dataset capabilities.
Findings
MULTI effectively disentangles imaging factors in generated images.
The method improves control over image styles and viewpoints.
Evaluation on DF-RICO benchmark shows significant performance gains.
Abstract
Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
