Multimodal Large Language Models and Tunings: Vision, Language, Sensors,   Audio, and Beyond

Soyeon Caren Han; Feiqi Cao; Josiah Poon; Roberto Navigli

arXiv:2410.05608·cs.CL·October 10, 2024

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

Soyeon Caren Han, Feiqi Cao, Josiah Poon, Roberto Navigli

PDF

Open Access 1 Repo

TL;DR

This tutorial reviews recent developments in multimodal large models that integrate diverse data types like vision, language, audio, and sensors, highlighting datasets, models, and tuning strategies for practical applications.

Contribution

It provides a comprehensive overview of multimodal pretrained models, datasets, and instruction tuning techniques, including hands-on labs for real-world multimodal AI applications.

Findings

01

Advancements in multimodal datasets and pretrained models.

02

Effective instruction tuning strategies for multimodal tasks.

03

Practical demonstrations of multimodal applications like visual storytelling.

Abstract

This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the foundational concepts of multimodality, the evolution of multimodal research, and the key technical challenges addressed by these models. We will cover the latest multimodal datasets and pretrained models, including those beyond vision and language. Additionally, the tutorial will delve into the intricacies of multimodal large models and instruction tuning strategies to optimise performance for specific tasks. Hands-on laboratories will offer practical experience with state-of-the-art multimodal models, demonstrating real-world applications like visual storytelling and visual question answering. This tutorial aims to equip researchers, practitioners, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

adlnlp/MultimodalLLM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems