S3: A Simple Strong Sample-effective Multimodal Dialog System
Elisei Rykov, Egor Malkershin, and Alexander Panchenko

TL;DR
This paper introduces S3, a simple yet effective multimodal dialog system that leverages pre-trained language and modality encoders, achieving near state-of-the-art results with minimal multimodal data.
Contribution
The paper proposes a novel multimodal dialog system architecture combining pre-trained models and a data mixture strategy, demonstrating high performance with limited training data.
Findings
Achieves near state-of-the-art results on MMMU and AI Journey Contest 2023
Effective training with a small amount of multimodal data
Utilizes a combination of pre-trained language and modality encoders
Abstract
In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Advanced Text Analysis Techniques
