Visual-O1: Understanding Ambiguous Instructions via Multi-modal   Multi-turn Chain-of-thoughts Reasoning

Minheng Ni; Yutao Fan; Lei Zhang; Wangmeng Zuo

arXiv:2410.03321·cs.CV·October 7, 2024

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Minheng Ni, Yutao Fan, Lei Zhang, Wangmeng Zuo

PDF

Open Access 1 Video

TL;DR

Visual-O1 introduces a multi-modal, multi-turn reasoning framework that enhances large models' ability to interpret ambiguous instructions by simulating human-like reasoning, improving performance across various datasets.

Contribution

The paper presents a novel multi-modal, multi-turn chain-of-thought reasoning framework that effectively disambiguates instructions without high computational costs.

Findings

01

Significantly improves model performance on ambiguous instructions

02

Enhances general dataset performance

03

Works effectively across different model intelligence levels

Abstract

As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning· slideslive

Taxonomy

TopicsNatural Language Processing Techniques