TL;DR
This paper introduces BOB, a novel fine-tuning strategy for text-to-image models that enhances synthetic data quality for low-shot fine-grained classification by explicitly conditioning on class-agnostic attributes, resulting in state-of-the-art performance.
Contribution
The paper proposes BOB, a fine-tuning method that conditions on class-agnostic attributes to improve synthetic data generation for fine-grained classification, reducing overfitting and inter-class confusion.
Findings
BOB outperforms DataDream by 7.4% on Aircraft dataset.
Synthetic data augmented with BOB improves low-shot classification accuracy.
BOB achieves better results than using more real images in most benchmarks.
Abstract
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well-motivated. - The paper is readable. - Numerous experiments demonstrate that this method is effective.
1. The BOB method is too simple and lacks novelty. Obtaining more detailed captions and combining image elements to generate new images is rather trivial. 2. If the T2I model directly generates synthetic data without fine-tuning—by simply adjusting prompts or random seeds—to augment real data, how would the downstream classification performance compare? 3. In Table 1, are the training epochs for the “Read only” method and the “BOB” model the same? If so, it means that “Read only” was trained f
1. It presents a clear causal framing for removing spurious context that integrates with existing SD backbones. 2. The experiment is conducted on multiple datasets and backbones to demonstrate its good performance.
1. BOB relies on captions extracted from only 5–10 real exemplars per class using a VLM (Qwen-VL-7B) to describe background and pose attributes. However, since modern large language models already possess rich world knowledge about object appearances and environments, this dependence on limited real images could be avoided. By prompting an LLM directly (e.g., “Describe possible backgrounds and poses for an aircraft image”), one could construct a large and diverse attribute bank without relying o
1.The paper addresses an important and practical problem, reducing spurious correlations in few-shot text-to-image (T2I) data generation, and demonstrates clear performance gains on fine-grained and long-tail classification tasks. 2.The proposed two-stage framework (context preservation + context marginalization) is conceptually simple, easy to implement, and yields consistent improvements across datasets and backbones.
1.The method feels largely like an enhanced prompt optimization pipeline. Although the authors combine LoRA fine-tuning (context preservation) with dataset-level randomization (context marginalization), the overall novelty is limited. 2.The causal explanation between foreground and background is not new — similar causal interpretations (e.g., back-door adjustment) have appeared in previous few-shot or domain generalization literature. 3.It would be more convincing if the proposed method were d
S1) The results show consistent improvement. S2) A wide range of settings is considered, which provides good coverage of base model and several numbers of shots. S3) The methodology is straightforward and well-explained. S4) The paper is written clearly.
W1) (related to Q1) There are specific downsides to the chosen evaluation setting, specifically with the datasets chosen when compared to what is done in other work (e.g. Kim 2024, also [A]). The five datasets in the chosen setting are only fine-grained datasets--this is much less comprehensive than the 10 datasets used in the other setting. In the other setting, there are also more general datasets (e.g. ImageNet, Caltech101), some out-of-distribution datasets (EuroSAT, DTD). W2) The work see
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
