PixelBytes: Catching Unified Embedding for Multimodal Generation
Fabien Furfaro

TL;DR
PixelBytes introduces a unified embedding approach for multimodal data, enabling coherent sequence generation across text and images by leveraging various model architectures and innovative embedding techniques.
Contribution
The paper presents PixelBytes Embedding, a novel unified representation method for multimodal data, integrating diverse inputs into a single cohesive embedding for sequence generation.
Findings
Bidirectional models with PxBy embedding generate coherent multimodal sequences.
PixelBytes achieves integrated understanding and generation of text and pixelated images.
Experimental results on the PixelBytes Pokémon dataset validate the approach.
Abstract
This report introduces PixelBytes Embedding, a novel approach for unified multimodal representation learning. Our method captures diverse inputs in a single, cohesive representation, enabling emergent properties for multimodal sequence generation, particularly for text and pixelated images. Inspired by state-of-the-art sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, PixelBytes aims to address the challenges of integrating different data types. We explore various model architectures, including Recurrent Neural Networks (RNNs), State Space Models (SSMs), and Attention-based models, focusing on bidirectional processing and our innovative PxBy embedding technique. Our experiments, conducted on a specialized PixelBytes Pok{\'e}mon dataset, demonstrate that bidirectional sequence models with PxBy embedding and convolutional layers can generate coherent multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsPixelCNN
