Comparison Visual Instruction Tuning
Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja, Giryes, Sepp Hochreiter, Leonid Karlinsky

TL;DR
This paper introduces CaD-VI, a new method for generating visual instructions and a large dataset to enhance large multimodal models' ability to compare images, significantly improving their visual reasoning capabilities.
Contribution
The paper presents a novel two-phase approach CaD-VI for synthetic visual instruction collection and a large dataset CaD-Inst, advancing image comparison skills in multimodal models.
Findings
Improves CaD spotting in LMMs by up to 17.5%.
Enhances existing difference-only datasets by up to 10%.
Provides a new benchmark with 7.5K open-ended questions.
Abstract
Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
this paper introduces CaD-VI, a two-phase approach for enhancing visual instruction tuning in large multimodal models (LMMs) with a specific focus on comparing commonalities and differences (CaD) between image pairs. this work addresses an underexplored area in LMMs and providing valuable insights for visual reasoning. the paper contributes a dataset CaD-Inst containing 349K image pairs for CaD instruction tuning, and CaD-QA, a benchmark of 7.5K open-ended questions designed to evaluate CaD cap
the paper uses the open-sourced LLM or MLLM to generate the training data. what if using the GPT-4 model to generate data directly from the image, rather than from the image descriptions? one less-costly approach could be just to test, say, 5 data points, and see how the gap is, although a more comprehensive way is to replace the open-sourced component with, e.g. gpt-4 in the proposed pipeline. in this way, it may also work to directly use the image rather than the caption to generate the CaD da
- The concept of recognizing commonalities and differences aligns well with human visual perception and design principles, bringing an innovative perspective to visual data processing. - The two-phase strategy is technically sound and shows promise for enhancing LMM capabilities. - Comprehensive statistics and data sources are provided, which strengthens the transparency of the data collection process. - Implementation details are thoroughly documented, offering useful insights for replication.
- The paper offers limited algorithmic innovation, as much of the method depends on synthetic data generation without addressing potential pitfalls. - The quality of generated data is uncertain, particularly regarding hallucinations or inaccuracies in synthetic outputs. More clarity on quality control measures or validation processes is needed to ensure data reliability. - The effectiveness of the commonalities and differences data remains ambiguous. In Table 7, results are presented with varied
1. This paper addresses an overlooked sub-problem in multi-image question answering. The proposed new dataset and benchmark may support the development of versatile multimodal large models and facilitate comprehensive evaluations. 2. Detailed ablation experiments on the data construction method provide valuable insights, guiding future training data development for multimodal models.
1. The paper’s scope is limited, focusing narrowly on comparing two images in terms of commonalities and differences (CaD), a sub-capability within the broader multi-image question-answering domain. Its main contribution lies in using LLMs to generate data for CaD, fine-tuning existing VLMs on this data, and performing in-domain evaluations, which constrains its general applicability. 2. The two-phase pipeline is similar to the multi-stage annotation process introduced by Kirillov et al.[1] Howe
1. The paper introduces a unique two-phase methodology, CaD-VI, specifically designed for training LMMs on commonality and difference (CaD) tasks. This novel approach effectively fills a gap in multimodal AI, where CaD reasoning has received limited focus. 2. By creating and releasing the CaD-Inst dataset with 349K image pairs, the authors provide a valuable resource for training and evaluating LMMs on nuanced visual reasoning tasks. This large dataset enhances model robustness in spotting both
1. The paper’s experiments mainly focus on benchmark performance gains without a clear demonstration of how the model performs on real-world, uncurated image pairs. Maybe like LVLM benchmarks like MME, hallusionbench, MMMU, MMC-Benchmark and so on. In hallusionbench, MMMU and MMC-Benchmark, they also have multiple image as input. 2. The authors mentioned that they use Mixtral 8 x 7B as the evaluator. How does it compare with human evaluation and GPT4? Is it possible to use non-llm method? 3. Is
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration · Educational Environments and Student Outcomes
MethodsSparse Evolutionary Training
