VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Mar\c{c}al Rusi\~nol,, Oriol Ramos Terrades

TL;DR
VLCDoC is a novel vision-language contrastive pre-training model that learns cross-modal representations for document classification by focusing on high-level interactions and semantic alignment across modalities.
Contribution
It introduces a new approach that leverages intra- and inter-modality attention and contrastive learning without merging features into a joint space, enhancing document classification.
Findings
Effective on both low-scale and large-scale datasets
Outperforms existing methods in document classification tasks
Demonstrates generality across diverse datasets
Abstract
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a joint representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the joint representation space}. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Text and Document Classification Technologies
