TL;DR
This paper introduces a novel approach for document-based zero-shot learning that extracts and aligns multi-view semantic concepts from documents and images, improving performance by focusing on partial rather than full concept alignment.
Contribution
The work proposes a semantic decomposition network with specialized loss functions to enable partial alignment of visual and textual semantic concepts, addressing redundancy and diversity issues.
Findings
Outperforms state-of-the-art methods on three benchmarks
Learned partial associations are interpretable
Effective semantic concept extraction from documents and images
Abstract
Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
