TL;DR
DIVE is a novel framework that significantly enhances the descriptiveness and diversity of visual commonsense inferences, achieving human-level performance and outperforming existing models.
Contribution
It introduces generic inference filtering and contrastive retrieval learning to improve diversity and descriptiveness in visual commonsense generation.
Findings
Outperforms state-of-the-art models in descriptiveness and diversity.
Achieves human-level performance on Visual Commonsense Graphs.
Human evaluations show close alignment with human judgments.
Abstract
Towards human-level visual understanding, visual commonsense generation has been introduced to generate commonsense inferences beyond images. However, current research on visual commonsense generation has overlooked an important human cognitive ability: generating descriptive and diverse inferences. In this work, we propose a novel visual commonsense generation framework, called DIVE, which aims to improve the descriptiveness and diversity of generated inferences. DIVE involves two methods, generic inference filtering and contrastive retrieval learning, which address the limitations of existing visual commonsense resources and training objectives. Experimental results verify that DIVE outperforms state-of-the-art models for visual commonsense generation in terms of both descriptiveness and diversity, while showing a superior quality in generating unique and novel inferences. Notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
