Find Someone Who: Visual Commonsense Understanding in Human-Centric   Grounding

Haoxuan You; Rui Sun; Zhecan Wang; Kai-Wei Chang; Shih-Fu Chang

arXiv:2212.06971·cs.CV·December 15, 2022

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

PDF

Open Access

TL;DR

This paper introduces a new task called Human-centric Commonsense Grounding, along with a large dataset, HumanCog, to evaluate models' ability to ground individuals based on context and commonsense reasoning in images.

Contribution

The paper proposes a novel human-centric grounding task, creates the HumanCog dataset with 130k annotations, and develops a baseline method that surpasses previous models.

Findings

01

Rich visual commonsense is crucial for accurate grounding.

02

Multi-modal integration significantly improves performance.

03

The baseline outperforms existing pre-trained models.

Abstract

From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications