Vinci: A Real-time Embodied Smart Assistant based on Egocentric   Vision-Language Model

Yifei Huang; Jilan Xu; Baoqi Pei; Yuping He; Guo Chen; Lijin Yang,; Xinyuan Chen; Yaohui Wang; Zheng Nie; Jinyao Liu; Guoshun Fan; Dechen Lin,; Fang Fang; Kunpeng Li; Chang Yuan; Yali Wang; Yu Qiao; Limin Wang

arXiv:2412.21080·cs.CV·December 31, 2024

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang,, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin,, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang

PDF

Open Access 1 Repo

TL;DR

Vinci is a real-time, portable egocentric vision-language assistant that enables seamless, hands-free interaction, providing contextual responses, task planning, and visual demonstrations based on continuous environment observation.

Contribution

It introduces Vinci, a novel real-time embodied AI system for portable devices that combines egocentric vision, natural language understanding, and visual task demonstrations.

Findings

01

Operates in real-time on portable devices.

02

Provides contextual and historical environment understanding.

03

Generates visual step-by-step task demonstrations.

Abstract

We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/vinci
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems