TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining
Qing Zong, Zhaowei Wang, Baixuan Xu, Tianshi Zheng, Haochen Shi, Weiqi, Wang, Yangqiu Song, Ginny Y. Wong, Simon See

TL;DR
TILFA is a novel unified framework that effectively combines text, images, and layout information to improve argument mining, especially in datasets containing visual elements and optical characters.
Contribution
The paper introduces TILFA, the first framework to fuse text, image, and layout data for argument mining, achieving state-of-the-art performance on a new multimodal dataset.
Findings
TILFA outperforms existing baselines in argumentative stance classification.
The framework successfully detects optical characters and recognizes layout details.
Achieved 1st place in the shared task leaderboard.
Abstract
A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard of Argumentative Stance Classification subtask in this shared task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
MethodsAttention Model
