# A Vision-Based Subtitle Generator: Text Reconstruction via Subtle Vibrations from Videos

**Authors:** Yan Wang, Yingchong Wang, Xiuqi Zhang, Xiaoyu Ding

PMC · DOI: 10.3390/s26051407 · Sensors (Basel, Switzerland) · 2026-02-24

## TL;DR

This paper introduces a system that generates subtitles from subtle vibrations in objects caused by sound, using video and AI techniques.

## Contribution

The first generative approach to recover text directly from sound-induced object vibrations using high-speed video.

## Key findings

- VSG-Transformer variants achieved character error rates of 13.7% and 12.5% using vibrations from a bag of chips.
- The system works effectively even with limited temporal sampling in videos.
- Phase-based motion estimation and HuBERT encoder reduce reliance on large video datasets.

## Abstract

Subtle vibrations induced in everyday objects by ambient sound, especially speech, carry rich acoustic cues that can potentially be transformed into meaningful text, with potential implications for monitoring and security-related scenarios. This paper presents a Vision-based Subtitle Generator (VSG). This is the first attempt to recover text directly from high-speed videos of sound-induced object vibrations using a generative approach. To this end, VSG introduces a phase-based motion estimation (PME) technique that treats each pixel as an “independent microphone”, and extracts thousands of pseudo-acoustic signals from a single video. Meanwhile, the pretrained Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) serves as the encoder of the proposed VSG-Transformer architecture, effectively transferring large-scale acoustic representation knowledge to the vibration-to-text task. These strategies significantly reduce reliance on large volumes of video data. Experimentally, text was generated from vibrations induced in a bag of chips by AISHELL-1 audio samples. Two VSG-Transformer variants with different parameter scales (Base and Large) achieved character error rates of 13.7 and 12.5%, respectively, demonstrating the remarkable effectiveness of the proposed generative approach. Furthermore, experiments using signal upsampling techniques show that the VSG-Transformer maintains effective performance when operating on videos with limited temporal sampling, indicating robustness to lower sampling rates.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12986701/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12986701/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/PMC12986701/full.md

---
Source: https://tomesphere.com/paper/PMC12986701