Gamified crowd-sourcing of high-quality data for visual fine-tuning
Shashank Yadav, Rohan Tomar, Garvit Jain, Chirag Ahooja, Shubham, Chaudhary, Charles Elkan

TL;DR
This paper presents a gamified crowdsourcing framework called GAP that collects high-quality visual instruction data, significantly improving the performance of small multimodal models and demonstrating cross-model benefits.
Contribution
The paper introduces a scalable, engaging platform for crowd-sourcing targeted visual question-answer pairs to enhance multimodal model training, addressing model weaknesses effectively.
Findings
Improved model accuracy from 0.147 to 0.477 GPT score.
Collected data from over 50,000 participants in weeks.
Enhanced performance across multiple models and benchmarks.
Abstract
This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model's knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, increasing its GPT…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper presents an innovative Gamified Adversarial Prompting (GAP) framework that effectively integrates human involvement with multimodal learning strategies, significantly enhancing the performance of multimodal AI models in visual question answering and paving a new research direction. 2. The GAP framework underscores the vital role of human cognition and diverse perspectives in the model enhancement process, effectively mitigating biases and errors commonly associated with traditional
1. Although the GAP-VQA dataset has been filtered to ensure a high proportion of adversarial examples, the diversity and representativeness of its samples still require further validation. The selected 3,683 question-image pairs may not adequately cover the diverse scenarios encountered in real-world applications. A lack of diversity could lead to suboptimal model performance on unseen tasks or images. 2. The evaluation of the model primarily relies on GPT-4 as the evaluator. While it can provid
- The paper is based on a very interesting idea, which is using gamification for collecting data for fine-tuning large multimodal-models. - The experiments demonstrate that the proposed approach improves the performance of a model, i.e. MiniCPM-Llama3-V- 2.5-8B. - The proposed system was used by several participants and a detailed analysis of users' participation is shown in the Appendix
- The writing of the paper needs significant improvements. The description of the method is confusing with some details only discussed in the supplementary material (see A.3 PLAYER INTERACTION MODEL). A lot of space is dedicated to related works while some additional details in the main text should have been also dedicated to describing the GAP and the final system. - The descriptions in L 337 about intrinsic and extrinsic factors is very high level and details on how this is integrated in th
1. By gamifying the process, GAP keeps players motivated and engaged, potentially leading to high-quality data collection. 2. By automatically evaluating and rewarding player submissions, this approach can effectively scale up the data. 3. The framework has demonstrated significant improvements in model accuracy on VQA benchmarks, indicating its effectiveness.
1. The number of baseline models used in the experiment is not enough, and the numerical results presented in Table 5 do not show significant changes. 2. Quality Control: While the framework aims for high-quality data, there may still be variability in the accuracy of player-generated content. 3. Unable to determine the specific classification of the questions asked by the player, making it difficult to balance the number of different types of questions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Educational Games and Gamification · Virtual Reality Applications and Impacts
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Linear Layer · Residual Connection · Weight Decay · Cosine Annealing · Dropout · Byte Pair Encoding · Softmax
