# Multi-modal dialog for browsing large visual catalogs using   exploration-exploitation paradigm in a joint embedding space

**Authors:** Indrani Bhattacharya, Arkabandhu Chowdhury, Vikas Raykar

arXiv: 1901.09854 · 2019-01-31

## TL;DR

This paper introduces a multi-modal dialog system that helps online shoppers visually explore large product catalogs by understanding text and image queries and responding with relevant images, using a joint embedding space and exploration-exploitation strategy.

## Contribution

The paper proposes a novel multi-modal dialog framework that models visual browsing as sampling from a Gaussian Mixture Model in a joint embedding space, incorporating an exploration-exploitation paradigm.

## Key findings

- Achieved an average cosine similarity of 0.85 with ground truth images.
- System effectively learns to display relevant images based on dialog context.
- Human evaluation indicates positive user engagement and system usability.

## Abstract

We present a multi-modal dialog system to assist online shoppers in visually browsing through large catalogs. Visual browsing is different from visual search in that it allows the user to explore the wide range of products in a catalog, beyond the exact search matches. We focus on a slightly asymmetric version of the complete multi-modal dialog where the system can understand both text and image queries but responds only in images. We formulate our problem of "showing $k$ best images to a user" based on the dialog context so far, as sampling from a Gaussian Mixture Model in a high dimensional joint multi-modal embedding space, that embed both the text and the image queries. Our system remembers the context of the dialog and uses an exploration-exploitation paradigm to assist in visual browsing. We train and evaluate the system on a multi-modal dialog dataset that we generate from large catalog data. Our experiments are promising and show that the agent is capable of learning and can display relevant results with an average cosine similarity of 0.85 to the ground truth. Our preliminary human evaluation also corroborates the fact that such a multi-modal dialog system for visual browsing is well-received and is capable of engaging human users.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.09854/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/1901.09854/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/1901.09854/full.md

---
Source: https://tomesphere.com/paper/1901.09854