From Pixels to Prose: A Large Dataset of Dense Image Captions
Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka, Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami, Somepalli, Tom Goldstein

TL;DR
PixelProse is a large, high-quality dataset of over 16 million detailed image captions generated using advanced vision-language models, designed to improve training for vision-language tasks.
Contribution
The paper introduces PixelProse, a comprehensive dataset of synthetically generated dense image captions, addressing the lack of detailed image descriptions in existing datasets.
Findings
Dataset contains over 16 million captions.
Rigorous analysis ensures data quality and safety.
Includes metadata for dataset filtering.
Abstract
Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Despite the AI-generated captions, Pixelprose captions are well-curated through a pipeline which is quite robust and safe 2. A remarkable effort toward providing higher quality captions for web-crawled datasets to the community.
1. The dataset contribution is made less valuable by the fact that there is no major use-case showing significant improvements. In more details, the authors propose an empirical comparison by fine-tuning Paligemma, or pre-training at small scale CLIP-Gemma adapters. While the results are positives on these two use-cases, both cases assume models to be pre-trained on other (usually proprietary) datasets, which are much larger than the 16M of PixelProse. The provided evidence only partially suppor
- The paper highlights a critical issue in current vision-language datasets—data quality. By addressing the noisiness and lack of detail in traditional datasets, this work aims to provide a more reliable data source for training. - To enhance data quality, the authors employ diverse prompting strategies to capture varied image details, include negative descriptions to explicitly identify absent objects, and integrate OCR to accurately capture and enhance text elements within images. - To ensure
- The primary issue with this paper is the insufficient experiments. The authors only evaluate their dataset by fine-tuning a pre-trained PaliGemma model, using a randomly selected subset of 2M samples. This experimental setup introduces significant bias, making it difficult to effectively demonstrate the dataset's validity, generalizability, and scalability. To strengthen the evaluation, additional experiments are necessary, including the following: - Training on a broader range of multimoda
+ A new dataset with 16 million image-text pairs, potentially useful for finetuning models pre-trained on larger but noisier and lower quality data. + Extra safety precautions in the collection of images compared to similar datasets and considerations with respect to privacy and toxicity. + Summarizes well the literature on the state of image-caption paired datasets and identifies convincingly some gaps and opportunities for improvement in current practices.
* The dataset is automatically generated, it doesn't seem like it would be difficult to reproduce collecting a similar dataset with the characteristics described here. * There is some empirical contribution in the construction of the dataset but there is not one clearly identifiable contribution. Looking at the experiments, it is hard to guess what were the factors that contributed to the improved performance. There are no ablations that can show any insights about this. * Results: GPT4V-100k o
+ The paper is well written and easy to follow. The authors have done a good job in providing motivation for their paper. + The consideration of ethical aspects is important, especially on large-scale datasets. + The analysis of the dataset statistics and comparison to other datasets is insightful.
- Novelty - Recaptioning image-text datasets is not new with several such works [1,2,3]. Unfortunately, I do not see significant improvements or new techniques compared to such works. - Negative description - the authors have claim that TTI and VLMs struggle with understanding negation and thus they include such phrases in the dataset, however, the effect of this is not demonstrated. - Repurposing captions into VQA pairs - while the motivation for doing so is clear, the authors do not show any
- The writing is clear. - Great care was taken to filter the dataset from NSFW content as well as personally identifiable content.
- Comparison of unique nouns is only provided with respect to other datasets of smaller size. However, the proposed innovation in this work is the generative caption enhancement. Therefore, a more apt comparison is between the original dataset and the enhanced one. This is especially the case since the authors claim that alt-text is a limited source of data. - Results are only provided over a 2M subset of the data. This is a very limited comparison because it is possible that gains are only re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
