Loading paper
Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation | Tomesphere