Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces
Zakaria Elabid, Jan Andrzejewski, Bartosz Brzoza, Attila Cangi

TL;DR
This paper investigates how chemically meaningful information can be extracted from Transformer-VAE latent spaces for molecules, emphasizing the importance of confound-aware evaluation to ensure true chemical signal rather than artifacts.
Contribution
It introduces a confound-aware evaluation method for molecular latent spaces and demonstrates robust property steering in Transformer-VAE models after proper validation.
Findings
Robust monotonic steering for multiple chemical properties under confound-aware evaluation.
Some properties have stable global directions, others are better described by local gradients.
Chemically meaningful steering can emerge in entangled latent spaces when validated properly.
Abstract
Molecular generative models often assume meaningful latent geometry, but apparent property predictability can reflect sequence-level shortcuts rather than chemical organization. We study this issue in an unsupervised autoregressive Transformer-VAE trained on SELFIES. After training, we freeze the model, fit linear probes to RDKit descriptors, and use the probe weights as candidate global steering directions. To separate chemical signal from SELFIES artifacts, we introduce a confound-aware evaluation based on residualization, confound-direction alignment analysis, and decoded-molecule traversal. This is necessary because SELFIES length, branch tokens, ring tokens, and token entropy are strongly encoded in the latent space. Under this confound-aware evaluation, we find robust monotonic steering for cLogP, FractionCSP3, HeavyAtomCount, TPSA, BertzCT, and HBA. Nonlinear probes further show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
