Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in   Language Models

Liqi He; Zuchao Li; Xiantao Cai; Ping Wang

arXiv:2312.08762·cs.AI·December 15, 2023·1 cites

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Liqi He, Zuchao Li, Xiantao Cai, Ping Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel multi-modal chain-of-thought reasoning method using latent space learning via diffusion processes, significantly improving complex reasoning in language models involving text and images.

Contribution

It proposes a new approach that aligns image features with language thoughts through latent space diffusion, surpassing previous fixed feature extraction methods.

Findings

01

Achieves state-of-the-art on ScienceQA benchmark.

02

Enhances multi-modal reasoning capabilities.

03

Demonstrates robustness across tasks.

Abstract

Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks

MethodsDiffusion · ALIGN