# Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary

**Authors:** Tahoshin Alam Ishat, Mohammad Abdul Qayum

arXiv: 2509.00033 · 2025-09-19

## TL;DR

This paper presents a multimodal system combining computer vision, motion analysis, and speech recognition to enable an AI to understand and generate cooking instructions in complex kitchen environments.

## Contribution

It introduces a novel integration of YOLOv8, LSTM, Whisper, and TinyLLaMa models for comprehensive culinary task analysis and instruction generation.

## Key findings

- Effective object detection in kitchen scenes
- Accurate movement analysis of hand gestures
- Successful recipe prediction and step-by-step guidance

## Abstract

This is a research exploring existing models and fine tuning them to combine a YOLOv8 segmentation model, a LSTM model trained on hand point motion sequence and a ASR (whisper-base) to extract enough data for a LLM (TinyLLaMa) to predict the recipe and generate text creating a step by step guide for the cooking procedure. All the data were gathered by the author for a robust task specific system to perform best in complex and challenging environments proving the extension and endless application of computer vision in daily activities such as kitchen work. This work extends the field for many more crucial task of our day to day life.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00033/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00033/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/2509.00033/full.md

---
Source: https://tomesphere.com/paper/2509.00033