BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models
Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, Ron Mokady

TL;DR
BBQ introduces a large-scale text-to-image model that enables precise control over object placement, size, and color through numeric annotations, enhancing fine-grained image generation without architectural changes.
Contribution
It presents a novel structured-text framework allowing direct numeric conditioning in text-to-image models, improving control and fidelity without modifying the model architecture.
Findings
Achieves strong box alignment in generated images.
Improves RGB color fidelity over baselines.
Enables intuitive user interfaces for image editing.
Abstract
Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis
