Revisiting Kernel Attention with Correlated Gaussian Process Representation
Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, Trong Nghia, Hoang

TL;DR
This paper introduces the Correlated Gaussian Process Transformer (CGPT), a novel model that enhances the representation capacity of GP-based transformers by allowing asymmetric attention through correlated GPs, and demonstrates improved performance on benchmarks.
Contribution
It proposes the CGPT, enabling asymmetric attention in GP transformers and provides a scalable sparse approximation, improving upon prior symmetric GP-based attention models.
Findings
CGPT outperforms previous GP-based transformers on benchmarks.
Sparse approximation improves scalability without sacrificing accuracy.
Allows asymmetric attention, increasing model expressiveness.
Abstract
Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
