S3: A Simple Strong Sample-effective Multimodal Dialog System

Elisei Rykov; Egor Malkershin; and Alexander Panchenko

arXiv:2406.18305·cs.CL·June 27, 2024

S3: A Simple Strong Sample-effective Multimodal Dialog System

Elisei Rykov, Egor Malkershin, and Alexander Panchenko

PDF

Open Access 1 Repo

TL;DR

This paper introduces S3, a simple yet effective multimodal dialog system that leverages pre-trained language and modality encoders, achieving near state-of-the-art results with minimal multimodal data.

Contribution

The paper proposes a novel multimodal dialog system architecture combining pre-trained models and a data mixture strategy, demonstrating high performance with limited training data.

Findings

01

Achieves near state-of-the-art results on MMMU and AI Journey Contest 2023

02

Effective training with a small amount of multimodal data

03

Utilizes a combination of pre-trained language and modality encoders

Abstract

In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s-nlp/s3
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Advanced Text Analysis Techniques