Loading paper
Multi-head or Single-head? An Empirical Comparison for Transformer Training | Tomesphere