PixelBytes: Catching Unified Embedding for Multimodal Generation

Fabien Furfaro

arXiv:2409.15512·cs.CV·October 23, 2024

PixelBytes: Catching Unified Embedding for Multimodal Generation

Fabien Furfaro

PDF

Open Access 1 Repo

TL;DR

PixelBytes introduces a unified embedding approach for multimodal data, enabling coherent sequence generation across text and images by leveraging various model architectures and innovative embedding techniques.

Contribution

The paper presents PixelBytes Embedding, a novel unified representation method for multimodal data, integrating diverse inputs into a single cohesive embedding for sequence generation.

Findings

01

Bidirectional models with PxBy embedding generate coherent multimodal sequences.

02

PixelBytes achieves integrated understanding and generation of text and pixelated images.

03

Experimental results on the PixelBytes Pokémon dataset validate the approach.

Abstract

This report introduces PixelBytes Embedding, a novel approach for unified multimodal representation learning. Our method captures diverse inputs in a single, cohesive representation, enabling emergent properties for multimodal sequence generation, particularly for text and pixelated images. Inspired by state-of-the-art sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, PixelBytes aims to address the challenges of integrating different data types. We explore various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our innovative PxBy embedding technique. Our experiments, conducted on a specialized PixelBytes Pok{\'e}mon dataset, demonstrate that bidirectional sequence models with PxBy embedding and convolutional layers can generate coherent multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fabienfrfr/pixelbytes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsPixelCNN