Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

Luzhou Peng; Zhengxin Yang; Honglu Ji; Yikang Yang; Fanda Fan; Wanling Gao; Jiayuan Ge; Yilin Han; Jianfeng Zhan

arXiv:2603.04408·cs.CL·March 6, 2026

Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

Luzhou Peng, Zhengxin Yang, Honglu Ji, Yikang Yang, Fanda Fan, Wanling Gao, Jiayuan Ge, Yilin Han, Jianfeng Zhan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Probing Memes paradigm, a new evaluation framework for LLMs that captures complex model-item interactions and reveals hidden capabilities and behaviors not visible in traditional metrics.

Contribution

It conceptualizes LLM evaluation as an entangled interaction between models and data using memes, Perception Matrices, and Meme Scores, providing a more nuanced understanding of model behaviors.

Findings

01

Reveals hidden capability structures in LLMs.

02

Quantifies phenomena invisible to traditional evaluation.

03

Supports more informative and extensible benchmarks.

Abstract

Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- The problem of better analyzing model evaluations is important - Some of the proposed metrics are interesting

Weaknesses

1. **Unclear motivation for the meme framing**: The conceptual link to Dawkins and memetics feels forced and adds unnecessary complexity without clear benefit. The core contributions (probe properties and phemotypes) could stand without this metaphor. The paper states that memes are "latent units of model capability that can be revealed through probing", but this is more of a renaming than a substantive theoretical contribution. Why is this memetics lens necessary or illuminating? 2. **Limited

Reviewer 02Rating 6Confidence 3

Strengths

1. Capability probing is an important aspect of foundation model evaluation. It helps us make evaluations more granular and extract more information from datapoints. This work makes a positive contribution in that direction. 2. The different probes and phemotypes are well-defined, in theory. 3. Testing 4507 models from OpenLLM Leaderboard is a substantial empirical contribution.

Weaknesses

1. The meme framework seems unnecessary and, without a more grounded theoretical framework and justification in the context of LLM evaluations, could be removed. It leads to unnecessary terminology such as perception matrix, meme probes and phemotypes. The core contributions would not be affected if the metaphor were removed. It also leads to some sections becoming quite confusing to read("latent units of model capability that can be revealed through probing"). 2. Several relevant papers have

Reviewer 03Rating 6Confidence 3

Strengths

1/ In this work, the authors reconceptualize evaluation as an entangled world of models and data, formalizing a perception matrix that supports probe-level properties and interpretable phemotypes; this exposes phenomena hidden by traditional benchmarks (e.g., elite models failing items most models solve) and scales to thousands of models. 2/ The authors validate the framework on a huge amount of LLMs, showing clear probe/property distributions, family-level structure in phemotype space, and pra

Weaknesses

1/ I suggest maybe the authors can consider broadening tasks (coding, RAG, agents) and adding head-to-head baselines (e.g., IRT-based compact sets, adversarial stress tests) to verify that phemotypes add incremental value beyond existing item- and ability-modeling approaches. 2/ An ablation on property definitions, thresholds, and clustering (e.g., Leiden parameters) would clarify robustness and generality. 3/ For the evaluation results, the authors can add multi-judge adjudication, uncertaint

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Language and cultural evolution