BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Eliran Kachlon; Alexander Visheratin; Nimrod Sarid; Tal Hacham; Eyal Gutflaish; Saar Huberman; Hezi Zisman; David Ruppin; Ron Mokady

arXiv:2602.20672·cs.CV·February 25, 2026

BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, Ron Mokady

PDF

Open Access

TL;DR

BBQ introduces a large-scale text-to-image model that enables precise control over object placement, size, and color through numeric annotations, enhancing fine-grained image generation without architectural changes.

Contribution

It presents a novel structured-text framework allowing direct numeric conditioning in text-to-image models, improving control and fidelity without modifying the model architecture.

Findings

01

Achieves strong box alignment in generated images.

02

Improves RGB color fidelity over baselines.

03

Enables intuitive user interfaces for image editing.

Abstract

Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis