PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

arXiv:2410.01820·cs.CV·October 22, 2024

PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

PDF

Open Access 1 Repo

TL;DR

PixelBytes introduces a unified multimodal representation learning approach that integrates text, audio, actions, and images, demonstrating improved autoregressive modeling and diffusion techniques for complex data generation tasks.

Contribution

The paper proposes PixelBytes, a novel framework for unified multimodal representation learning, combining various data modalities and exploring diverse model architectures and diffusion methods.

Findings

01

Autoregressive models outperform predictive models in multimodal tasks.

02

Diffusion models can be effectively applied to control problems.

03

Parallelized generation enhances efficiency in multimodal data synthesis.

Abstract

This report presents PixelBytes, an approach for unified multimodal representation learning. Drawing inspiration from sequence models like Image Transformers, PixelCNN, and Mamba-Bytes, we explore integrating text, audio, action-state, and pixelated images (sprites) into a cohesive representation. We conducted experiments on a PixelBytes Pokemon dataset and an Optimal-Control dataset. Our investigation covered various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, with a focus on bidirectional processing and our PxBy embedding technique. We evaluated models based on data reduction strategies and autoregressive learning, specifically examining Long Short-Term Memory (LSTM) networks in predictive and autoregressive modes. Our results indicate that autoregressive models perform better than predictive models in this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fabienfrfr/pixelbytes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsFocus · Diffusion · PixelCNN