Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

Benjamin Reichman; Constantin Patsch; Jack Truxal; Atishay Jain; Larry Heck

arXiv:2506.09953·cs.CV·June 12, 2025

Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck

PDF

Open Access 1 Repo

TL;DR

This paper introduces the OKCV dataset, a large-scale collection of videos and dialogues that challenge models to recognize visual details over time and incorporate external knowledge for conversational video understanding.

Contribution

The paper presents a new dataset for visual dialogue over videos with external knowledge, enabling research on temporally grounded visual understanding and knowledge integration.

Findings

01

Baseline models show limited performance, indicating the task's difficulty.

02

The dataset highlights the need for models to combine visual recognition with external knowledge.

03

Future research directions include developing models that better integrate dialogue context and external knowledge.

Abstract

In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2, 017$ videos with $5, 986$ human-annotated dialogues consisting of $40, 954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

c-patsch/okcv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Visual Attention and Saliency Detection