Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang,, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi,, Yejin Choi

TL;DR
This paper introduces a method to enhance vision-language models with localized reasoning capabilities by distilling knowledge from large language models, enabling precise referencing of image regions for improved commonsense understanding.
Contribution
It presents a novel approach to train localized visual commonsense models that support reference-based inputs, improving zero-shot reasoning in vision-language tasks.
Findings
Localized models outperform baseline in zero-shot reasoning
Training with localized commonsense improves model precision
Human evaluations favor the proposed distillation approach
Abstract
Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
