Loading paper
Distilling to Hybrid Attention Models via KL-Guided Layer Selection | Tomesphere