Vision Encoder-Decoder Models for AI Coaching

Jyothi S Nayak; Afifah Khan Mohammed Ajmal Khan; Chirag Manjeshwar and; Imadh Ajaz Banday

arXiv:2311.16161·cs.CV·November 29, 2023·1 cites

Vision Encoder-Decoder Models for AI Coaching

Jyothi S Nayak, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar and, Imadh Ajaz Banday

PDF

Open Access 2 Repos

TL;DR

This paper presents an integrated vision-encoder-decoder model for AI coaching that processes visual inputs directly to enable natural dialogue, simplifying architecture and improving user experience.

Contribution

The paper introduces a novel architecture combining Vision Transformer and GPT-2 for direct visual input processing in AI coaching, differing from traditional separate models.

Findings

01

Model effectively processes visual inputs for coaching dialogues.

02

Scalability demonstrated with different GPT-2 sizes.

03

Potential for versatile AI coaching applications.

Abstract

This research paper introduces an innovative AI coaching approach by integrating vision-encoder-decoder models. The feasibility of this method is demonstrated using a Vision Transformer as the encoder and GPT-2 as the decoder, achieving a seamless integration of visual input and textual interaction. Departing from conventional practices of employing distinct models for image recognition and text-based coaching, our integrated architecture directly processes input images, enabling natural question-and-answer dialogues with the AI coach. This unique strategy simplifies model architecture while enhancing the overall user experience in human-AI interactions. We showcase sample results to demonstrate the capability of the model. The results underscore the methodology's potential as a promising paradigm for creating efficient AI coach models in various domains involving visual inputs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Linear Warmup With Cosine Annealing · Layer Normalization · Linear Layer