TL;DR
This paper introduces the novel task of Visual Text Correction, aiming to identify and replace inaccurate words in video descriptions by leveraging semantic relationships between videos and text.
Contribution
It proposes a deep network model for detecting and correcting inaccuracies in textual video descriptions, and introduces a method to automatically create a large dataset for this task.
Findings
The proposed model effectively detects and corrects inaccurate words.
The dataset construction approach enables large-scale training and evaluation.
Results demonstrate the model's strong performance on the VTC task.
Abstract
Videos, images, and sentences are mediums that can express the same semantics. One can imagine a picture by reading a sentence or can describe a scene with some words. However, even small changes in a sentence can cause a significant semantic inconsistency with the corresponding video/image. For example, by changing the verb of a sentence, the meaning may drastically change. There have been many efforts to encode a video/sentence and decode it as a sentence/video. In this research, we study a new scenario in which both the sentence and the video are given, but the sentence is inaccurate. A semantic inconsistency between the sentence and the video or between the words of a sentence can result in an inaccurate description. This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacing an inaccurate word in the textual description of a video. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
