OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal   Integration of Large Language Models

Lala Shakti Swarup Ray; Bo Zhou; Sungho Suh; Paul Lukowicz

arXiv:2501.00432·cs.CV·January 3, 2025

OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models

Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz

PDF

Open Access

TL;DR

This paper introduces OV-HHIR, an open vocabulary human interaction recognition framework that uses large language models to describe interactions in open-world scenarios, overcoming fixed-vocabulary limitations.

Contribution

It presents a novel open vocabulary recognition method leveraging large language models and creates a comprehensive dataset for human interaction understanding.

Findings

01

Outperforms traditional fixed-vocabulary systems

02

Effective in recognizing unseen interactions

03

Sets new benchmarks in open-world interaction recognition

Abstract

Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems