TL;DR
AceTone introduces a unified, multimodal framework for color grading that uses generative models conditioned on text or images, achieving state-of-the-art results and aesthetic alignment.
Contribution
It is the first approach to support multimodal conditioned color grading within a single framework, utilizing a VQ-VAE tokenizer and reinforcement learning for perceptual quality.
Findings
Achieves up to 50% improvement in LPIPS over existing methods.
State-of-the-art performance on text-guided and reference-guided grading tasks.
Human evaluations confirm visually pleasing and stylistically coherent results.
Abstract
Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a LUT vector to 64 discrete tokens with fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
