Vision Encoder-Decoder Models for AI Coaching
Jyothi S Nayak, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar and, Imadh Ajaz Banday

TL;DR
This paper presents an integrated vision-encoder-decoder model for AI coaching that processes visual inputs directly to enable natural dialogue, simplifying architecture and improving user experience.
Contribution
The paper introduces a novel architecture combining Vision Transformer and GPT-2 for direct visual input processing in AI coaching, differing from traditional separate models.
Findings
Model effectively processes visual inputs for coaching dialogues.
Scalability demonstrated with different GPT-2 sizes.
Potential for versatile AI coaching applications.
Abstract
This research paper introduces an innovative AI coaching approach by integrating vision-encoder-decoder models. The feasibility of this method is demonstrated using a Vision Transformer as the encoder and GPT-2 as the decoder, achieving a seamless integration of visual input and textual interaction. Departing from conventional practices of employing distinct models for image recognition and text-based coaching, our integrated architecture directly processes input images, enabling natural question-and-answer dialogues with the AI coach. This unique strategy simplifies model architecture while enhancing the overall user experience in human-AI interactions. We showcase sample results to demonstrate the capability of the model. The results underscore the methodology's potential as a promising paradigm for creating efficient AI coach models in various domains involving visual inputs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Linear Warmup With Cosine Annealing · Layer Normalization · Linear Layer
