# Sign2Story: A Multimodal Framework for Near-Real-Time Hand Gestures via Smartphone Sensors to AI-Generated Audio-Comics

**Authors:** Gul Faraz, Lei Jing, Xiang Li

PMC · DOI: 10.3390/s26020596 · Sensors (Basel, Switzerland) · 2026-01-15

## TL;DR

This paper introduces a system that uses smartphone sensors and AI to create audio comics from news headlines using hand gestures, offering an alternative to touch or voice input.

## Contribution

The novel contribution is a gesture-based multimodal framework that integrates motion sensors and generative AI for real-time comic creation and audio narration.

## Key findings

- LLaVA outperformed Qwen3-VL in generating panel-aligned stories in terms of speed and quality.
- The system supports extensibility by mapping different hand gestures to various AI tasks.
- An AI-in-the-loop mechanism improves output quality without human intervention.

## Abstract

This study presents a multimodal framework that uses smartphone motion sensors and generative AI to create audio comics from live news headlines. The system operates without direct touch or voice input, instead responding to simple hand-wave gestures. The system demonstrates potential as an alternative input method, which may benefit users who find traditional touch or voice interaction challenging. In the experiments, we investigated the generation of comics on based on the latest tech-related news headlines using Really Simple Syndication (RSS) on a simple hand wave gesture. The proposed framework demonstrates extensibility beyond comic generation, as various other tasks utilizing large language models and multimodal AI could be integrated by mapping them to different hand gestures. Our experiments with open-source models like LLaMA, LLaVA, Gemma, and Qwen revealed that LLaVA delivers superior results in generating panel-aligned stories compared to Qwen3-VL, both in terms of inference speed and output quality, relative to the source image. These large language models (LLMs) collectively contribute imaginative and conversational narrative elements that enhance diversity in storytelling within the comic format. Additionally, we implement an AI-in-the-loop mechanism to iteratively improve output quality without human intervention. Finally, AI-generated audio narration is incorporated into the comics to create an immersive, multimodal reading experience.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12845707/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12845707/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12845707/full.md

---
Source: https://tomesphere.com/paper/PMC12845707