Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li; Yuanhan Zhang; Liangyu Chen; Jinghao Wang; Fanyi Pu; Joshua Adrian Cahyono; Jingkang Yang; Ziwei Liu

arXiv:2305.03726·cs.CV·July 29, 2025·87 cites

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu

PDF

Open Access 1 Repo 2 Models

TL;DR

Otter is a multi-modal model that leverages both text and visual in-context examples through instruction tuning, significantly improving multi-modal understanding and generalization in tasks involving images and videos.

Contribution

The paper introduces Otter, a multi-modal model built on Flamingo with Perceiver architecture, and the MIMIC-IT dataset with over 3 million instruction-response pairs for enhanced instruction tuning.

Findings

01

Otter demonstrates improved instruction following and generalization.

02

MIMIC-IT dataset enables better multi-modal understanding.

03

Model excels in complex video and multi-image tasks.

Abstract

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luodian/otter
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Cosine Annealing · Linear Layer · Dropout · Byte Pair Encoding · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Dropout