BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

James Baker

arXiv:2408.04785·cs.CV·August 12, 2024

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

James Baker

PDF

Open Access 1 Repo

TL;DR

This paper introduces BRAT, a novel approach for textual inversion that employs bonus tokens and a vision transformer to improve personalization of diffusion models across different architectures.

Contribution

It proposes a new method using bonus tokens and orthogonality constraints, enabling architecture-agnostic textual inversion without relying on the UNet.

Findings

01

Bonus tokens improve adherence to source images

02

Vision transformer enhances adherence to prompts

03

Method is architecture-agnostic and improves personalization

Abstract

Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at https://github.com/jamesBaker361/tex_inv_plus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jamesbaker361/tex_inv_plus
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsAttention Is All You Need · Softmax · Dense Connections · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Vision Transformer · Diffusion