UIPress: Bringing Optical Token Compression to UI-to-Code Generation
Dasen Dai, Shuoqi Li, Ronghao Chen, Huacan Wang, Biao Wu, Qizhen Lan

TL;DR
UIPress introduces a learned optical compression module for UI-to-Code generation, significantly reducing token count and latency while outperforming existing methods on design tasks.
Contribution
It is the first encoder-side learned compression approach for UI-to-Code, combining novel techniques to efficiently compress visual tokens with minimal additional parameters.
Findings
Achieves a 9.1× speedup in time-to-first-token.
Outperforms baseline models with a 7.5% higher CLIP score.
Compresses approximately 6,700 visual tokens to 256 tokens.
Abstract
UI-to-Code generation requires vision-language models (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
