Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Jae Sung Park; Jack Hessel; Khyathi Raghavi Chandu; Paul Pu Liang,; Ximing Lu; Peter West; Youngjae Yu; Qiuyuan Huang; Jianfeng Gao; Ali Farhadi,; Yejin Choi

arXiv:2312.04837·cs.AI·December 13, 2023·1 cites

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang,, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi,, Yejin Choi

PDF

Open Access 2 Repos

TL;DR

This paper introduces a method to enhance vision-language models with localized reasoning capabilities by distilling knowledge from large language models, enabling precise referencing of image regions for improved commonsense understanding.

Contribution

It presents a novel approach to train localized visual commonsense models that support reference-based inputs, improving zero-shot reasoning in vision-language tasks.

Findings

01

Localized models outperform baseline in zero-shot reasoning

02

Training with localized commonsense improves model precision

03

Human evaluations favor the proposed distillation approach

Abstract

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training