RoadTones: Tone Controllable Text Generation from Road Event Videos

Chirag Parikh; Siddhi Pravin Lipare; Ravi Kiran Sarvadevabhatla

arXiv:2605.21411·cs.CV·May 21, 2026

RoadTones: Tone Controllable Text Generation from Road Event Videos

Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla

PDF

1 Datasets

TL;DR

This paper introduces RoadTones, a new dataset, model, and evaluation suite for tone-controllable captioning of road event videos, enabling more effective and context-sensitive communication.

Contribution

It presents a comprehensive dataset, a novel controllable video captioning model with interpretability features, and an evaluation suite for tone and factual consistency.

Findings

01

RoadTones-51K dataset with diverse tonal annotations

02

RoadTones-VL-CoT model achieves tone control and interpretability

03

User study confirms improved caption quality and tone adherence

Abstract

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

siddhi-lipare/RoadTones
dataset· 2.0k dl
2.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.