TL;DR
Gen-n-Val is a novel framework that uses large language models and diffusion techniques to generate high-quality, diverse synthetic data for object detection and segmentation, significantly improving performance on challenging benchmarks.
Contribution
It introduces a new agentic data generation system leveraging LLMs and VLLMs, reducing invalid data and enhancing detection accuracy on rare classes.
Findings
Reduces invalid synthetic data from 50% to 7%.
Improves rare class detection by 7.6% mAP on LVIS.
Achieves 7.1% mAP improvement in open-vocabulary detection.
Abstract
The data scarcity, label noise, and long-tailed category imbalance remain important and unresolved challenges in many computer vision tasks, such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), a Large Language Model (LLM), and a Vision Large Language Model (VLLM) to produce high-quality and diverse instance masks and images for object detection and instance segmentation. Gen-n-Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes rompts to encourage LD to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
