ByteGen: A Tokenizer-Free Generative Model for Orderbook Events in Byte Space
Yang Li, Zhi Chen

TL;DR
ByteGen introduces a novel byte-level generative model for high-frequency limit order book data, eliminating tokenization and feature engineering to better capture market dynamics directly from raw byte streams.
Contribution
It is the first end-to-end byte-level framework for LOB modeling that uses a compact data representation and a hybrid architecture to learn directly from raw market message bytes.
Findings
Successfully reproduces key market stylized facts
Generates realistic price and event distributions
Achieves competitive performance without tokenization biases
Abstract
Generative modeling of high-frequency limit order book (LOB) dynamics is a critical yet unsolved challenge in quantitative finance, essential for robust market simulation and strategy backtesting. Existing approaches are often constrained by simplifying stochastic assumptions or, in the case of modern deep learning models like Transformers, rely on tokenization schemes that affect the high-precision, numerical nature of financial data through discretization and binning. To address these limitations, we introduce ByteGen, a novel generative model that operates directly on the raw byte streams of LOB events. Our approach treats the problem as an autoregressive next-byte prediction task, for which we design a compact and efficient 32-byte packed binary format to represent market messages without information loss. The core novelty of our work is the complete elimination of feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
