Merlin:Empowering Multimodal LLMs with Foresight Minds
En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong,, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao

TL;DR
This paper introduces Merlin, a multimodal large language model enhanced with foresight capabilities through novel training methods, enabling better future prediction and reasoning about multiple objects and actions.
Contribution
The paper proposes Foresight Pre-Training and Foresight Instruction-Tuning to incorporate future modeling into MLLMs, creating a unified model with advanced foresight abilities.
Findings
Merlin demonstrates strong performance in future reasoning tasks.
Merlin effectively analyzes multi-image inputs for potential future actions.
Foresight training methods improve MLLMs' predictive capabilities.
Abstract
Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
