Learning Models for Actions and Person-Object Interactions with Transfer   to Question Answering

Arun Mallya; Svetlana Lazebnik

arXiv:1604.04808·cs.CV·July 29, 2016·2 cites

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

Arun Mallya, Svetlana Lazebnik

PDF

Open Access

TL;DR

This paper introduces deep convolutional models that leverage context for action recognition and person-object interactions, and demonstrates how these features enhance question answering accuracy in VQA tasks.

Contribution

The work presents novel deep models for activity and interaction recognition, and shows their transferability to improve VQA performance on related question types.

Findings

01

Achieved state-of-the-art results on activity datasets.

02

Improved VQA accuracy using specialized features.

03

Effective handling of unbalanced data with weighted loss.

Abstract

This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition