Motivation
I started looking into eval-awareness because it sits upstream of almost every other trust problem in AI safety — if a model behaves differently when it knows it’s being tested, every downstream evaluation we rely on becomes unreliable. Safety evaluations are only trustworthy if models don’t behave differently when they detect they’re being tested. Eval-awareness — the ability of a model to distinguish evaluation from deployment contexts — directly undermines our ability to assess model safety. Chaudhary et al. (2025) claimed this capability scales predictably as a power law across model families, with larger models becoming progressively more eval-aware. But that claim rested on sparse within-family data and excluded the entire Qwen model family citing contamination concerns that were never empirically tested. This post reports what I found when I tested that claim directly across 11 models and 3 families — and why the answer is more complicated than a power law.
Background
Nguyen et al. (2025) conducted the first white-box probe study of eval-awareness, adapting the contrastive activation addition methodology of Rimsky et al. (2024) to train linear probes on residual stream activations in Llama-3.3-70B-Instruct. They found that linear probes reliably separate evaluation from deployment prompts, suggesting that models internally represent this distinction as a linear feature.
Building on this, Chaudhary et al. (2025) claimed that eval-awareness scales as a power law — larger models consistently show stronger awareness across model families. But this claim rests on sparse data: only one or two size checkpoints per family, with the entire Qwen 2.5 family excluded on contamination grounds. The citations establish math benchmark contamination in Qwen — but none test for contamination of the SAD benchmark or the Jordine contrastive dataset used in the eval-awareness experiments. Chaudhary ran synthetic contamination injection experiments on Qwen — artificially corrupting their training pipeline at different rates — but never tested whether Qwen’s pretraining data contained SAD questions. These are different experiments testing different questions. The inference that math contamination predicts eval-awareness dataset contamination was never tested.
This post addresses both questions directly.
What I Did
I tested 11 models across three families: Qwen 2.5 (six sizes, 0.5B to 32B), Gemma 2 (three sizes, 2B to 27B), and Llama 3.2 (two sizes, 1B and 3B). All models used an identical pipeline adapted from Nguyen et al. (2025): linear probes trained on the Jordine contrastive dataset — 203 synthetic eval/deployment prompt pairs — then evaluated on the SAD stages_oversight benchmark (Laine et al., 2024), which provides ground-truth labels distinguishing evaluation from deployment contexts. Llama 3.2 was included as a contrast family: with only two size checkpoints, no within-family scaling claims are made — it serves to show that the phase transition observed in Qwen and Gemma is not universal.
To directly test Chaudhary’s contamination claim, I ran 2 independent diagnostics on 11 models: a ROUGE-L completion test measuring surface-level memorisation, and a perplexity-ratio test measuring whether models find the SAD questions more familiar than held-out text. Both methods follow Ravaut et al.’s (2025) contamination survey.
Finally, I complemented white-box probe measurements with black-box behavioural testing — directly prompting each model to classify SAD questions as evaluation or deployment contexts on a balanced 200-question dataset. This tests whether internal representations manifest in observable model behaviour, and whether the two methods agree.
The Phase Transition

Figure. 1.1
Transformers process text through a stack of layers, each progressively refining the model’s internal representation. In the white-box approach, I extract probe AUROC scores at every layer — measuring how accurately each layer separates evaluation from deployment prompts. The layer with the highest AUROC is where eval-awareness is most strongly encoded.
Figure 1.1 reveals a striking pattern in both Qwen 2.5 and Gemma 2. In Qwen, the 1.5B and 3B models peak at relative depths of 0.96 and 0.97 — the final 3-4% of the network. Gemma 2B and 9B peak at 0.885 and 0.976. In these smaller models, eval-awareness appears to only resolve as a final high-level inference, just before output generation. Notably, Qwen 0.5B peaks at middle layers (depth 0.33), suggesting the late-layer pattern emerges with scale within the small-model regime rather than being present from the very smallest size.
As model size increases, this flips dramatically. Qwen 14B and 32B peak at layers 1 and 2 — relative depths of 0.021 and 0.031, the very first transformer blocks. Gemma 27B shifts to 0.304. Large models appear to detect the evaluation context almost immediately after reading the prompt, suggesting the eval/deployment distinction may be encoded as an early, low-level contextual feature rather than a late semantic conclusion.
Llama 3.2, included as a contrast family, shows no late-layer peaks at either tested size — both 1B and 3B peak in middle layers (0.375 and 0.43). Unlike Qwen and Gemma, there is no evidence of the late-layer encoding pattern observed in smaller models of those families, suggesting the phase transition is architecture and training-recipe dependent rather than a universal property of scaling.
The Power Law Doesn’t Hold

Figure. 1.2
A power law predicts that as model size increases, a capability increases predictably — plotting as a smooth upward curve. For eval-awareness, neither Qwen 2.5 nor Gemma 2 follow this pattern.
Figure 1.2 tells a different story for each family. In Qwen 2.5, probe AUROC rises from 0.711 at 0.5B to a peak of 0.792 at 3B, drops to 0.690 at 7B, then partially recovers to 0.730–0.754 at 14B–32B. This is non-monotonic — not a power law. In Gemma 2, the trend is consistently inverse: AUROC declines from 0.756 at 2B to 0.660 at 9B and 0.586 at 27B. Larger Gemma models are measurably less eval-aware than smaller ones by this metric.
Notably, Gemma 2-2B achieves the highest probe AUROC of any model in its family (0.756), and Llama 3.2-1B achieves the highest of any model tested overall (0.873) — neither consistent with a model-size-drives-awareness narrative.
These patterns directly challenge Chaudhary et al.’s power-law claim. Their headline regression relied on one or two size checkpoints per family. With six Qwen data points and three Gemma data points, the claimed universality breaks down as a consistent within-family pattern.
White-Box vs Black-Box Divergence

Figure. 1.3
To complement the white-box approach, I directly prompted each model to classify SAD questions as evaluation or deployment contexts — a black-box test of whether internal representations manifest in observable behaviour. Each model was tested on 400 balanced SAD questions (200 evaluation-labeled, 200 deployment-labeled) with a fixed random seed for reproducibility. Chance baseline is 0.5.
Figure 1.3 shows probe AUROC, black-box accuracy, and peak layer depth together for each family. Across all three families, probe AUROC consistently exceeds black-box accuracy — the internal representation is stronger than what models express behaviourally.
Within Qwen 2.5, the divergence is largest in the 1.5B and 3B models: despite having the highest probe AUROC of any Qwen model (0.791 and 0.792), both score below chance behaviourally (0.490 and 0.475). These are also the models with the latest-layer peaks (depth 0.96 and 0.97). By contrast, Qwen 14B and 32B — which peak in the very first layers — achieve the highest black-box accuracy of any Qwen model (0.615 and 0.588). Within Qwen, earlier-layer representations appear more behaviourally accessible.
Gemma 2 shows the opposite pattern. Gemma 2-2B, which peaks late (depth 0.885), achieves the highest black-box accuracy of any model tested (0.770). Gemma 27B, which shifts to earlier layers after the phase transition (depth 0.304), drops to near chance (0.507). The depth-behaviour relationship inverts between the two families.
Llama 3.2 shows the largest overall divergence: Llama 1B has the strongest probe signal of any model tested (AUROC 0.873) yet scores essentially at chance behaviourally (0.504). The internal representation is clearly present but does not drive output behaviour.
Internal representations and behavioural expression of eval-awareness are related but distinct. The gap between them is family-dependent and not explained by probe AUROC magnitude alone. The black-box results reinforce this pattern: neither Qwen nor Gemma show monotonically increasing behavioural accuracy with scale. If anything, the trend reverses — Gemma’s largest model scores near chance while its smallest scores highest. Behaviour, like internal representation, does not follow a power law.
Contamination Diagnostics
The contamination diagnostics in Chaudhary’s own codebase test pipeline robustness to synthetic data corruption, not pretraining exposure to the evaluation benchmark. The ROUGE-L and perplexity diagnostics here are the first direct tests of pretraining contamination on these models for this specific benchmark.
ROUGE-L:
Chaudhary et al. excluded Qwen citing contamination of the evaluation datasets — but never tested this directly. I ran two independent diagnostics to address this empirically.
The probe training data (the Jordine contrastive dataset) is a synthetic dataset created in 2025, making prior exposure during Qwen’s pretraining implausible. The contamination diagnostics below test the evaluation data — SAD stages_oversight — directly.
ROUGE-L completion test. ROUGE-L measures similarity between two texts using longest common subsequence — words appearing in the same relative order. I provided the first 40% of each SAD question as a prompt and asked the model to complete it, withholding the remaining 60% as the reference. If a model has memorised the dataset, completions should closely match the original — producing high ROUGE-L scores. I ran this on 100 SAD questions across eleven models. All models score between 0.114 and 0.158. Llama 3.2 (1B: 0.123, 3B: 0.121), used as a baseline with no plausible SAD contamination, is indistinguishable from Qwen and Gemma scores.
| Model | Family | ROUGE-L | Std | Interpretation |
|---|---|---|---|---|
| 0.5B | Qwen | 0.118 | 0.107 | No signal |
| 1.5B | Qwen | 0.158 | 0.151 | No signal |
| 3B | Qwen | 0.135 | 0.124 | No signal |
| 7B | Qwen | 0.1486 | 0.1449 | No signal |
| 14B | Qwen | 0.1354 | 0.1380 | No signal |
| 32B | Qwen | 0.1430 | 0.1244 | No signal |
| 1B | Llama | 0.123 | 0.113 | Baseline |
| 3B | Llama | 0.121 | 0.105 | Baseline |
| 2B | Gemma 2 | 0.114 | 0.100 | No signal |
| 9B | Gemma 2 | 0.1373 | 0.1481 | No signal |
| 27B | Gemma 2 | 0.1415 | 0.1432 | No signal |
Notably, Gemma 2-2B — which shows the strongest black-box eval-awareness of any model (0.770) — scores the lowest ROUGE-L of any model tested (0.114). If contamination drove that result, the opposite pattern would be expected.
Perplexity-Based Contamination Diagnostic
A contaminated model assigns lower perplexity to memorised text — it finds familiar text less surprising. I computed mean per-token perplexity on 100 SAD stages_oversight questions and compared it against 100 questions from SAD test_full, a held-out subset from the same benchmark with matching domain, style, and token length. Using the same benchmark family as baseline controls for stylistic differences between datasets. A ratio greater than 1.0 means the model finds SAD stages_oversight harder than the held-out baseline — the opposite of contamination. All eight tested models show ratios greater than 1.0, ranging from 1.18 (Qwen 32B) to 6.31 (Gemma 9B).
| Model | Family | SAD Perplexity | Baseline Perplexity | Ratio | Interpretation |
|---|---|---|---|---|---|
| 0.5B | Qwen | 51.23 | 24.94 | 2.054 | No contamination — SAD harder than baseline |
| 1.5B | Qwen | 32.66 | 19.18 | 1.703 | No contamination — SAD harder than baseline |
| 3B | Qwen | 33.68 | 22.41 | 1.503 | No contamination — SAD harder than baseline |
| 7B | Qwen | 32.63 | 23.99 | 1.360 | No contamination — SAD harder than baseline |
| 14B | Qwen | 29.42 | 18.96 | 1.552 | No contamination — SAD harder than baseline |
| 32B | Qwen | 26.25 | 22.25 | 1.180 | No contamination — SAD harder than baseline |
| 1B | Llama | 63.25 | 28.99 | 2.182 | No contamination — SAD harder than baseline |
| 3B | Llama | 49.68 | 26.42 | 1.881 | No contamination — SAD harder than baseline |
| 2B | Gemma 2 | 146.02 | 25.31 | 5.769 | No contamination — SAD substantially harder than baseline |
| 9B | Gemma 2 | 168.90 | 26.79 | 6.305 | No contamination — SAD substantially harder than baseline |
| 27B | Gemma 2 | 78.28 | 25.24 | 3.102 | No contamination — SAD substantially harder than baseline |
These two methods test for different contamination mechanisms — surface memorisation and token-level familiarity — and both show no signal. This does not rule out semantic contamination through diffuse pretraining exposure to AI evaluation discourse, but it directly addresses the specific claims made by Chaudhary et al., whose cited papers established math benchmark contamination only. The anomalous probe patterns in Qwen and Gemma survive two independent contamination tests.
Discussion
The findings raise three questions worth engaging with directly, even if definitive answers require further work.
Why do large models encode eval-awareness earlier?
One hypothesis is that smaller models detect the eval/deployment distinction through surface lexical patterns — multiple choice formatting, structured prompt templates, evaluation-specific terminology — that only become detectable after substantial processing in late layers. Larger models may encode the distinction more abstractly, as a property of global prompt context that becomes linearly separable in the earliest attention layers. This would explain both the phase transition and the early-layer peak concentration: larger models don’t need to process the prompt deeply before classifying its context.
It is worth noting that probes measure linear recoverability of a feature, not whether the feature exists. A late-layer peak could mean the feature is only computed late, or it could mean the feature exists throughout but is only linearly separable at that depth. Distinguishing these interpretations would require activation patching experiments that go beyond what probing alone can establish.
Why does the depth-behaviour relationship differ between families?
Within Qwen, models with the latest-layer peaks (1.5B and 3B at depth 0.96-0.97) score below chance behaviourally, while models with the earliest-layer peaks (14B and 32B) score highest. Within Gemma, this relationship inverts. One possible interpretation: the point in the forward pass where a representation becomes most linearly detectable may differ from the point at which it influences output generation. If Gemma’s architecture routes information to output heads differently from Qwen’s, the same late-layer representation could have different downstream effects. One potentially relevant architectural difference: Gemma 2 uses alternating local and global attention layers, while Qwen 2.5 uses standard grouped query attention throughout. Whether this affects how late-layer representations propagate to output generation is unknown — confirming any architectural account would require controlled experiments. This is speculative — confirming it would require understanding how each family’s residual stream connects to its output projection.
What the data do establish without mechanistic speculation: the relationship between internal representation and behavioural expression is family-dependent and is not predicted by probe AUROC magnitude alone. Two models can have identical peak AUROC scores and dramatically different black-box accuracy.
What does this mean for evaluation design?
If large models encode eval-awareness in the first few transformer blocks, our benchmarks may be more legible to frontier models than we assumed. The model doesn’t need to reason its way to recognising an evaluation — it appears to classify the context almost immediately. This has a direct practical implication: the surface properties of evaluation prompts matter more at scale, not less. Honeypot authenticity, prompt naturalness, and the absence of benchmark-specific formatting may be more important for eliciting honest behaviour from large models than from small ones. Evaluation designers should treat prompt realism as a frontier-model concern, not an afterthought.
Limitations
Several limitations should be noted when interpreting these findings.
Within-family coverage is uneven. The phase transition finding is most robustly supported in Qwen 2.5, where six size checkpoints trace the full arc from middle-layer peaks to early-layer peaks. In Gemma 2, three checkpoints show the same directional pattern but with less resolution. In Llama 3.2, only two closely-spaced checkpoints (1B and 3B) are available — insufficient to determine whether a phase transition exists at larger sizes. The strength of evidence differs across families accordingly.
The phase transition boundary is unresolved. In Qwen 2.5, the flip from late-layer to early-layer peaks occurs between 7B (depth 0.56) and 14B (depth 0.021) — a large gap. Where exactly the transition occurs within that range is unknown. A model at 10B or 11B could be highly informative but was not tested.
All probe evaluation uses a single benchmark. Every AUROC measurement is computed on SAD stages_oversight. This benchmark’s specific definition of evaluation versus deployment contexts shapes every finding. Whether the same patterns hold on alternative eval-awareness benchmarks is untested.
Linear probes measure linear recoverability, not feature existence. A low AUROC at a given layer means the eval/deployment distinction is not linearly separable there — not necessarily that the information is absent. Non-linear probe methods might reveal different patterns and are worth exploring in future work. Additionally, probes trained on a small synthetic dataset (203 Jordine entries) may not fully capture the eval-awareness concept as it manifests across diverse real-world contexts.
Black-box results are prompt-sensitive. The black-box classification accuracy depends on the specific prompt used to elicit responses. A different prompt formulation asking the same underlying question could yield different accuracy numbers. Prompt sensitivity was not systematically tested, limiting the generalisability of the black-box findings.
Cross-family comparisons conflate multiple variables. Differences between Qwen, Gemma, and Llama reflect a combination of architecture, training data composition, RLHF methodology, and model generation. These cannot be disentangled from probe results alone. Attributing the phase transition to any specific factor — such as model architecture or training recipe — would require controlled experiments beyond the scope of this study. Gemma 2’s alternating local/global attention architecture is one concrete difference from Qwen 2.5 that could contribute to the inverted depth-behaviour relationship, but this cannot be tested from probing results alone.
Semantic contamination cannot be ruled out. A more nuanced contamination hypothesis holds that Qwen’s training data may have contained more AI evaluation discourse than other families — papers, blog posts, and alignment forum discussions — producing semantically inflated eval-awareness signals without requiring direct SAD exposure. This cannot be ruled out by ROUGE-L or perplexity diagnostics alone. However, this hypothesis predicts uniformly inflated signals across Qwen model sizes, whereas the data show non-monotonic AUROC and a systematic phase transition in representational depth. A contamination mechanism that affects 7B Qwen models differently from 14B Qwen models would need to be separately proposed and evidenced.
Conclusion
This study set out to test a specific claim: that eval-awareness scales predictably as a power law across model families. The data don’t support that claim.
Across 11 models in three families, I find that eval-awareness is real — models internally distinguish evaluation from deployment contexts at above-chance rates across all families tested. But it does not scale uniformly. Within Qwen 2.5 and Gemma 2, probe AUROC is non-monotonic or inversely correlated with model size. The power law, built on one or two data points per family with Qwen excluded on untested grounds, does not survive contact with denser empirical sampling.
The more interesting finding is structural. In both Qwen 2.5 and Gemma 2, the representational depth at which eval-awareness peaks undergoes a size-dependent phase transition: small models peak in the final layers, large models flip to the very first layers. This pattern appears independently in two architecturally distinct families, and contamination diagnostics show no memorisation signal that would explain it away. Where a model encodes eval-awareness turns out to matter — within Qwen, earlier-layer representations correlate with stronger behavioural expression of that awareness.
The safety implication is direct. If frontier-scale models detect evaluation contexts within the first few transformer blocks — before any deep reasoning has occurred — then evaluation design cannot rely on making prompts cognitively complex enough to fool the model. The signal is extracted too early. Benchmark authenticity matters more at scale, not less, and probe-based auditing of internal representations provides a more reliable picture of this capability than behavioural testing alone.
The open question is whether the phase transition generalises. Two families is suggestive, not definitive. Replication across additional families at sufficient within-family resolution — and eventually, mechanistic experiments to understand what is actually being encoded at those early layers — is what would turn these observations into a complete account of how large language models know they are being tested.
Acknowledgments
This work was completed as part of the BlueDot Impact Technical AI Safety Project Sprint. I’m grateful to BlueDot for the opportunity and for the rapid compute grant that made the cloud GPU experiments possible. Thanks to Eitan Sprejer for weekly mentorship throughout this project — for pushing me to think more carefully about methodology, framing, and the questions worth asking. Thanks also to Shivam Arora for generously taking the time to share his perspective on the research direction.
Citation
Please cite this work as:
Manek, Archit. "Do Models Know They're Being Tested? Probing Eval-Awareness Across Scale and Architecture". www.architmanek.com (March 2026). https://www.architmanek.com/posts/eval-awareness/Or use the BibTeX citation:
@article{index2026,
title = {Do Models Know They're Being Tested? Probing Eval-Awareness Across Scale and Architecture},
author = {Manek, Archit},
journal = {www.architmanek.com},
year = {2026},
month = {March},
url = "https://www.architmanek.com/posts/eval-awareness/"
}