When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang, Zhi Gao, Zilong Zheng, Lei Liu, Bin Li, Qing Li

TL;DR
This paper introduces a new benchmark for evaluating large multimodal models' ability to incorporate evolving knowledge and proposes methods to improve knowledge injection and retention.
Contribution
It presents MMEVOKE, a benchmark for multimodal evolving knowledge, and explores new techniques for knowledge augmentation and retention in LMMs.
Findings
Knowledge-aware augmentation improves injection performance.
Data Replay and MoE methods reduce capability degradation.
Existing methods face challenges in dynamic knowledge injection.
Abstract
Large Multimodal Models (LMMs) store vast amounts of pretrained knowledge but struggle to remain aligned with real-world updates, making it difficult to avoid capability degradation when acquiring evolving knowledge. Furthermore, most current work focuses on exploring static textual knowledge injection, neglecting dynamic multimodal evolving knowledge injection, leaving the potential of LMMs for multimodal knowledge injection as an open question. To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection. MMEVOKE contains 9,422 samples spanning 159 subtypes. Then, based on extensive experiments with MMEVOKE, we reveal challenges such as poor injection performance and capability degradation in existing knowledge injection methods through knowledge injection tests and general capability tests.…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed MMEVOKE benchmark serves as the first evaluation dataset designed to measure the evolving knowledge injection capabilities of Large Multimodal Models. 2. This work systematically evaluates a wide range of approaches for their effectiveness in knowledge injection, including Supervised Fine-Tuning, Retrieval-Augmented Generation (RAG), Web Search Engines, and Sufficient Context Provision. The results indicate that knowledge augmentation substantially enhances model comprehension an
1. It appears that the proposed concept of evolving knowledge injection is essentially a continual learning problem. The connection between evolving knowledge injection and continual learning remains unclear. A more detailed discussion of this relationship should be included in the Introduction or Related Work sections to better position this study within the broader research context. 2. In the field of continual learning, several recent studies have explored the potential of large multimodal mo
1. In this paper,the authors proposed MMEVOKE for multimodal evolving knowledge, which, serves as a evaluation dataset to measure LMMs’ evolving knowledge injection capabilities. 2. The authors conduct knowledge injection tests with Supervised FineTuning, Retrieval Augmented Generation, Web Search Engine, and Sufficient Context on MMEVOKE. Based on the experimental results, the authors find that existing methods exhibit poor knowledge adaptation performance and the performance of LMMs remains
1. In my view, the authors overlook a type of method, knowledge editing, such as ROME ( Locating and Editing Factual Associations in GPT), AnyEdit (AnyEdit: Edit Any Knowledge Encoded in Language Models.) and MEMIT (Mass-Editing Memory in a Transformer). 2. In Benchmark construction, the authors compare offline versions of Wikipedia at different time points to identify new entries. But such a way cannot cannot guarantee that these entities will be unfamiliar to LMMs, since LMMs are pre-triaine
1. MMEvoke leverages real-world data to benchmark LMM’s abilities to adapt to evolving knowledge. The benchmark is comprehensive, containing 9,422 knowledge and covering 159 subfields. 2. The paper conducts a comprehensive evaluation (12 benchmarks) spanning training- and retrieval-based methods, e.g., SFT (Full/LoRA) to RAG, commercial agents, and sufficient context, yielding useful cross-method comparisons. 3. The analysis of knowledge-aware vs knowledge-agonistic augmentation and Replay/ MoEL
1. It is unclear why injecting images of new knowledge is necessary, since the text often suffices to answer questions once the model recognizes the person and recalls textual knowledge. In both examples of “Geoffrey Hinton + Nobel Prize” and “Donald Trump + Assassination”, the text provides all the information needed to answer the question. Or, in the case of news for “Region” or “Business” categories, the images often don’t provide closely related information to the text. 2. The work does not
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Speech and dialogue systems
