Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

Jiang Liu; Ge Qiu; Hao Fei; Dongdong Xie; Jinbo Li; Fei Li; Chong Teng; Donghong Ji

arXiv:2603.20781·cs.CL·March 24, 2026

Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

Jiang Liu, Ge Qiu, Hao Fei, Dongdong Xie, Jinbo Li, Fei Li, Chong Teng, Donghong Ji

PDF

Open Access

TL;DR

This paper introduces Code-MIE, a novel framework that formalizes multimodal information extraction as code understanding and generation, leveraging structured templates, scene graphs, and entity attributes to improve performance.

Contribution

The paper presents a unified code-style approach for multimodal information extraction, incorporating entity attributes and scene graphs, and formalizing input/output as Python code templates.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Effectively integrates visual and textual information.

03

Outperforms six baseline models in experiments.

Abstract

With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining