COLLAGE: Collaborative Human-Agent Interaction Generation using   Hierarchical Latent Diffusion and Language Models

Divyanshu Daiya; Damon Conover; Aniket Bera

arXiv:2409.20502·cs.LG·October 1, 2024

COLLAGE: Collaborative Human-Agent Interaction Generation using Hierarchical Latent Diffusion and Language Models

Divyanshu Daiya, Damon Conover, Aniket Bera

PDF

Open Access

TL;DR

COLLAGE is a new framework that combines large language models and hierarchical VQ-VAE-based diffusion models to generate realistic, diverse, and controllable collaborative human-object-human interactions, addressing dataset limitations.

Contribution

It introduces a hierarchical VQ-VAE architecture with a latent diffusion model guided by LLMs for motion generation, enabling multi-resolution, prompt-specific interaction synthesis.

Findings

01

Outperforms state-of-the-art methods on CORE-4D and InterHuman datasets.

02

Generates realistic and diverse collaborative interactions.

03

Provides greater control and diversity in motion generation.

Abstract

We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsDiffusion · VQ-VAE