Video In Sentences Out
Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven, Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy,, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell, Waggoner, Song Wang, Jinlian Wei, Yifan Yin

TL;DR
This paper introduces a system that generates detailed sentential descriptions of videos, capturing actions, participants, spatial relations, and event characteristics through an integrated event recognition approach.
Contribution
It presents a novel method combining object tracking, role assignment, and posture analysis to produce comprehensive natural language descriptions of video content.
Findings
Successfully extracts object tracks and role assignments.
Generates accurate linguistic descriptions of actions and relations.
Integrates multiple recognition components for detailed video narration.
Abstract
We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the trackto-role assignments, and changing body posture.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
