InstructSeq: Unifying Vision Tasks with Instruction-conditioned   Multi-modal Sequence Generation

Rongyao Fang; Shilin Yan; Zhaoyang Huang; Jingqiu Zhou; Hao Tian,; Jifeng Dai; Hongsheng Li

arXiv:2311.18835·cs.CV·December 1, 2023·2 cites

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian,, Jifeng Dai, Hongsheng Li

PDF

Open Access

TL;DR

InstructSeq is a multimodal transformer framework that unifies diverse vision tasks through natural language instructions, enabling flexible, instruction-driven visual task execution without task-specific tuning.

Contribution

It introduces a unified, instruction-conditioned multimodal model that handles multiple vision tasks using natural language control and a transformer architecture.

Findings

01

Achieves strong performance on semantic segmentation, referring expression tasks, and image captioning.

02

Operates effectively without task-specific fine-tuning.

03

Provides an intuitive natural language interface for diverse vision tasks.

Abstract

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling