Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

TL;DR
This paper introduces a new task called Human-centric Commonsense Grounding, along with a large dataset, HumanCog, to evaluate models' ability to ground individuals based on context and commonsense reasoning in images.
Contribution
The paper proposes a novel human-centric grounding task, creates the HumanCog dataset with 130k annotations, and develops a baseline method that surpasses previous models.
Findings
Rich visual commonsense is crucial for accurate grounding.
Multi-modal integration significantly improves performance.
The baseline outperforms existing pre-trained models.
Abstract
From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications
