Do Models Know They're Being Tested? Probing Eval-Awareness Across Scale and Architecture

Manek, Archit

Do Models Know They're Being Tested? Probing Eval-Awareness Across Scale and Architecture

March 27, 2026 · 15 min

Table of Contents

TLDR

Across 11 models in three families, I find a size-dependent phase transition in how models internally encode eval-awareness: small models peak in the final layers, large models flip to the very first layers — confirmed independently in Qwen 2.5 and Gemma 2. Probe AUROC does not scale monotonically with model size in Qwen 2.5, and scales inversely in Gemma 2, challenging prior claims of a universal power law. White-box probing consistently reveals stronger signals than black-box testing, and the gap between the two methods is family-dependent.

Background

If a model behaves differently when it knows it’s being tested, every safety evaluation we run on it becomes unreliable. Nguyen et al. (2025) conducted the first white-box probe study of eval-awareness, adapting the contrastive activation addition methodology of Panickssery et al. (2024) to train linear probes on residual stream activations in Llama-3.3-70B-Instruct. They found that linear probes reliably separate evaluation from deployment prompts, suggesting that models internally represent this distinction as a linear feature.

Building on this, Chaudhary et al. (2025) claimed that eval-awareness scales as a power law — larger models consistently show stronger awareness across model families. But this claim rests on sparse data: only one or two size checkpoints per family, with the entire Qwen 2.5 family excluded on contamination grounds. The cited papers establish math benchmark contamination in Qwen — but none test for contamination of the SAD benchmark or the Jordine contrastive dataset used in the eval-awareness experiments. Chaudhary ran synthetic contamination injection experiments — artificially corrupting a training pipeline at different rates — but never tested whether Qwen’s pretraining data contained SAD questions. These are different experiments testing different questions. This post tests both claims directly across 11 models and 3 families.

What I Did

I tested 11 models across three families: Qwen 2.5 (six sizes, 0.5B to 32B), Gemma 2 (three sizes, 2B to 27B), and Llama 3.2 (two sizes, 1B and 3B). All models were evaluated using the probing methodology of Nguyen et al. (2025). For each model and each transformer layer, I computed a contrastive probe direction from the Jordine contrastive dataset — 203 synthetic eval/deployment prompt pairs. The probe direction at each layer is the difference between mean residual stream activations across evaluation-labelled prompts and deployment-labelled prompts, L2-normalised. No gradient-based training is performed; the direction is derived analytically. Each model is then evaluated on the SAD stages_oversight benchmark (Laine et al., 2024), which provides ground-truth labels distinguishing evaluation from deployment contexts. Llama 3.2 was included as a contrast family: with only two size checkpoints, no within-family scaling claims are made — it serves to show that the phase transition observed in Qwen and Gemma is not universal.

To directly test Chaudhary’s contamination claim, I ran two independent diagnostics across eleven models. The first is a ROUGE-L completion test following Golchin & Surdeanu (2023), measuring surface-level memorisation by comparing model completions against withheld portions of SAD questions. The second is a perplexity-ratio test, measuring whether models find SAD questions more familiar than held-out text — a method surveyed in Ravaut et al. (2024).

Finally, I complemented white-box probe measurements with black-box behavioural testing — directly prompting each model to classify SAD questions as evaluation or deployment contexts on a balanced 400-question dataset (200 evaluation-labeled, 200 deployment-labeled). This tests whether internal representations manifest in observable model behaviour, and whether the two methods agree.

All code and results are available on GitHub

The Phase Transition

Figure 1.1: Relative peak layer depth vs model size. Each point is the layer (normalised 0–1) where probe AUROC peaks for that model. Orange shading marks late layers (>0.75), blue shading marks early layers (<0.12). Llama 3.2 shown as a contrast family (dashed line).

Transformers process text through a stack of layers, each progressively refining the model’s internal representation. In the white-box approach, I extract probe AUROC scores at every layer — measuring how accurately each layer separates evaluation from deployment prompts. The layer with the highest AUROC is where the eval/deployment distinction is most linearly recoverable from the residual stream.

Figure 1.1 reveals a striking pattern in both Qwen 2.5 and Gemma 2. In Qwen, the 1.5B and 3B models peak at relative depths of 0.96 and 0.97 — the final 3-4% of the network. Gemma 2B and 9B peak at 0.885 and 0.976. In these smaller models, eval-awareness appears to only resolve as a final high-level inference, just before output generation. Notably, Qwen 0.5B peaks at middle layers (depth 0.33), suggesting the late-layer pattern emerges with scale within the small-model regime rather than being present from the very smallest size.

As model size increases, this flips dramatically. Qwen 14B and 32B peak at layers 1 and 2 — relative depths of 0.021 and 0.031, the very first transformer blocks. Gemma 27B shifts to 0.304. Large models appear to detect the evaluation context almost immediately after reading the prompt, suggesting the eval/deployment distinction may be encoded as an early, low-level contextual feature rather than a late semantic conclusion.

Llama 3.2, included as a contrast family, shows no late-layer peaks at either tested size — both 1B and 3B peak in middle layers (0.375 and 0.43). Unlike Qwen and Gemma, there is no evidence of the late-layer encoding pattern observed in smaller models of those families, suggesting the phase transition is architecture and training-recipe dependent rather than a universal property of scaling.

The Power Law Doesn’t Hold

Figure 1.2: Peak probe AUROC vs model size (log scale). Solid lines connect same-family checkpoints. Dotted line marks chance (0.5). Llama 3.2 shown as a contrast family (dashed).

A power law predicts that as model size increases, a capability increases at a consistent, predictable rate — plotting as a smooth upward curve on a log scale. For eval-awareness, neither Qwen 2.5 nor Gemma 2 follow this pattern.

Figure 1.2 tells a different story for each family. In Qwen 2.5, probe AUROC rises from 0.711 at 0.5B to a peak of 0.792 at 3B, drops to 0.690 at 7B, then partially recovers to 0.730–0.754 at 14B–32B. This is non-monotonic — not a power law. In Gemma 2, the trend is consistently inverse: AUROC declines from 0.756 at 2B to 0.660 at 9B and 0.586 at 27B. Larger Gemma models are measurably less eval-aware than smaller ones by this metric.

Notably, Gemma 2-2B achieves the highest probe AUROC of any model in its family (0.756), and Llama 3.2-1B achieves the highest of any model tested overall (0.873) — neither consistent with a model-size-drives-awareness narrative.

These patterns directly challenge Chaudhary et al. (2025)’s power-law claim. Their headline regression relied on one or two size checkpoints per family. With six Qwen data points and three Gemma data points, the claimed universality breaks down as a consistent within-family pattern.

White-Box vs Black-Box Divergence

Figure 1.3: Each panel shows probe AUROC (solid), black-box classification accuracy (dashed), and relative peak layer depth (orange, right axis) across model sizes. Shaded region marks the gap between internal representation strength and behavioural expression. Chance baseline at 0.5 (dotted).

To complement the white-box approach, I directly prompted each model to classify SAD questions as evaluation or deployment contexts — a black-box test of whether internal representations manifest in observable behaviour. Each model was tested on 400 balanced SAD benchmark questions (200 evaluation-labeled, 200 deployment-labeled) with a fixed random seed (42) for reproducibility.

Figure 1.3 shows probe AUROC, black-box accuracy, and peak layer depth together for each family. Across all three families, probe AUROC consistently exceeds black-box accuracy — the internal representation is stronger than what models express behaviourally.

Within Qwen 2.5, the divergence is largest in the 1.5B and 3B models: despite having the highest probe AUROC of any Qwen model (0.791 and 0.792), both score below chance behaviourally (0.490 and 0.475). These are also the models with the latest-layer peaks (depth 0.96 and 0.97). By contrast, Qwen 14B and 32B — which peak in the very first layers — achieve the highest black-box accuracy of any Qwen model (0.615 and 0.588). Within Qwen, earlier-layer representations appear more behaviourally accessible.

Gemma 2 shows the opposite pattern. Gemma 2-2B, which peaks late (depth 0.885), achieves the highest black-box accuracy of any model tested (0.770). Gemma 27B, which shifts to earlier layers after the phase transition (depth 0.304), drops to near chance (0.507). The depth-behaviour relationship inverts between the two families. Gemma 2 uses alternating local and global attention layers, which may affect how representations at different depths connect to output generation. Whether this explains the inverted depth-behaviour relationship is an open question.

Llama 3.2 shows the largest overall divergence: Llama 1B has the strongest probe signal of any model tested (AUROC 0.873) yet scores essentially at chance behaviourally (0.508). The representation is clearly linearly decodable from the residual stream, yet does not manifest in output behaviour — the largest probe/behaviour gap of any model tested.

Internal representations and behavioural expression of eval-awareness are related but distinct. The gap between them is family-dependent and not explained by probe AUROC magnitude alone. The black-box results reinforce this pattern: neither Qwen nor Gemma show monotonically increasing behavioural accuracy with scale. If anything, the trend reverses — Gemma’s largest model scores near chance while its smallest scores highest. Behaviour, like internal representation, does not follow a power law.

Contamination Diagnostics

The contamination diagnostics in Chaudhary’s own codebase test pipeline robustness to synthetic data corruption, not pretraining exposure to the evaluation benchmark. The ROUGE-L and perplexity diagnostics here are the first direct tests of pretraining contamination on these models for this specific benchmark.

ROUGE-L

Chaudhary et al. (2025) excluded Qwen citing contamination of the evaluation datasets — but never tested this directly. The probe training data (the Jordine contrastive dataset) is a synthetic dataset created in 2025, making prior exposure during Qwen’s pretraining implausible. The contamination diagnostics below test the evaluation data — SAD stages_oversight — directly.

ROUGE-L completion test. I provided the first 60% of each SAD question as a prompt and asked the model to complete it, withholding the remaining 40% as the reference. If a model has memorised the dataset, completions should closely match the original — producing high ROUGE-L scores. All eleven models score between 0.114 and 0.158, with no meaningful separation between Qwen, Gemma, and Llama. Llama 3.2, used as a baseline with no plausible SAD contamination, is indistinguishable from Qwen and Gemma scores. Notably, Gemma 2-2B — which shows the strongest black-box eval-awareness of any model tested (0.770) — scores the lowest ROUGE-L of any model (0.114). If contamination drove that result, the opposite pattern would be expected.

Perplexity-Based Contamination Diagnostic

A contaminated model assigns lower perplexity to memorised text — it finds familiar text less surprising. I computed mean per-token perplexity on 100 SAD stages_oversight questions and compared it against 100 questions from SAD test_full, a held-out subset from the same benchmark with matching domain, style, and token length. A ratio greater than 1.0 means the model finds stages_oversight harder than the held-out baseline — the opposite of contamination. All tested models show ratios greater than 1.0, ranging from 1.18 (Qwen 32B) to 6.31 (Gemma 9B).

These two diagnostics test for different contamination mechanisms — surface memorisation and token-level familiarity — and both show no signal. This does not rule out semantic contamination through diffuse pretraining exposure to AI evaluation discourse, but it directly addresses the specific claims made by Chaudhary et al., whose cited papers established math benchmark contamination only. The anomalous probe patterns in Qwen and Gemma survive both tests.

Discussion

The findings raise three questions worth engaging with directly, even if definitive answers require further work.

Why do large models encode eval-awareness earlier?

One hypothesis is that smaller models detect the eval/deployment distinction through surface lexical patterns — multiple choice formatting, structured prompt templates, evaluation-specific terminology — that only become detectable after substantial processing in late layers. Larger models may encode the distinction more abstractly, as a property of global prompt context that becomes linearly separable in the earliest attention layers. This would explain both the phase transition and the early-layer peak concentration: larger models don’t need to process the prompt deeply before classifying its context.

It is worth noting that probes measure linear recoverability of a feature, not whether the feature exists. A late-layer peak could mean the feature is only computed late, or it could mean the feature exists throughout but is only linearly separable at that depth. Distinguishing these interpretations would require activation patching experiments that go beyond what probing alone can establish.

Why does the depth-behaviour relationship differ between families?

Within Qwen, models with the latest-layer peaks (1.5B and 3B at depth 0.96–0.97) score below chance behaviourally, while models with the earliest-layer peaks (14B and 32B) score highest. Within Gemma, this relationship inverts. One possible interpretation: the point in the forward pass where a representation becomes most linearly detectable may differ from the point at which it influences output generation. If Gemma’s architecture routes information to output heads differently from Qwen’s, the same late-layer representation could have different downstream effects. Gemma 2 uses alternating local and global attention layers, while Qwen 2.5 uses standard grouped query attention throughout — whether this affects how late-layer representations propagate to output generation is unknown, and confirming any architectural account would require activation patching or controlled ablation experiments beyond what probing alone can establish.

What the data do establish without mechanistic speculation: the relationship between internal representation and behavioural expression is family-dependent and is not predicted by probe AUROC magnitude alone. Two models can have identical peak AUROC scores and dramatically different black-box accuracy.

What does this mean for evaluation design?

If large models encode eval-awareness in the first few transformer blocks, our benchmarks may be more legible to frontier models than we assumed. The model doesn’t need to reason its way to recognising an evaluation — it appears to classify the context almost immediately. This has a direct practical implication: the surface properties of evaluation prompts matter more at scale, not less. Honeypot authenticity, prompt naturalness, and the absence of benchmark-specific formatting may be more important for eliciting honest behaviour from large models than from small ones. Evaluation designers should treat prompt realism as a frontier-model concern, not an afterthought. This is particularly relevant for sandbagging evaluations — if a model detects it’s being tested within the first few transformer blocks, prompt-level interventions designed to elicit honest capability demonstrations may be insufficient.

Conclusion

This study set out to test a specific claim: that eval-awareness scales predictably as a power law across model families. The data don’t support that claim.

Across 11 models in three families, I find that eval-awareness is real — models internally distinguish evaluation from deployment contexts at above-chance rates across all families tested. But it does not scale uniformly. Within Qwen 2.5 and Gemma 2, probe AUROC is non-monotonic or inversely correlated with model size. The power law, built on one or two data points per family with Qwen excluded on untested grounds, does not survive contact with denser empirical sampling.

The more interesting finding is structural. In both Qwen 2.5 and Gemma 2, the representational depth at which eval-awareness peaks undergoes a size-dependent phase transition: small models peak in the final layers, large models flip to the very first layers. This pattern appears independently in two architecturally distinct families, and contamination diagnostics show no memorisation signal that would explain it away. Where a model encodes eval-awareness turns out to matter — within Qwen, earlier-layer representations correlate with stronger behavioural expression of that awareness.

The safety implication is direct. If frontier-scale models detect evaluation contexts within the first few transformer blocks — before any deep reasoning has occurred — then evaluation design cannot rely on making prompts cognitively complex enough to fool the model. The signal is extracted too early. Benchmark authenticity matters more at scale, not less, and probe-based auditing of internal representations provides a more reliable picture of this capability than behavioural testing alone.

A few caveats worth noting. The phase transition finding is most robustly supported in Qwen 2.5 with six checkpoints; Gemma 2 shows the same pattern with only three, and the precise crossover point between 7B and 14B is unresolved. All probe measurements use a single benchmark split, linear probes measure recoverability rather than feature existence, and semantic contamination through diffuse pretraining exposure to AI evaluation discourse cannot be fully ruled out. Whether the phase transition continues, stabilises, or reverses at frontier scale — the most safety-relevant question — remains untested.

The open question is whether the phase transition generalises. Two families is suggestive, not definitive. Replication across additional families at sufficient within-family resolution is the immediate next step. Mechanistic experiments to understand what is actually being encoded at those early layers would eventually turn these observations into a complete account.

Acknowledgments

This work was completed as part of the BlueDot Impact Technical AI Safety Project Sprint. I’m grateful to BlueDot for the opportunity and for the rapid compute grant that made the cloud GPU experiments possible. Thanks to Eitan Sprejer for weekly mentorship throughout this project — for pushing me to think more carefully about methodology, framing, and the questions worth asking. Thanks also to Shivam Arora for generously taking the time to share his perspective on the research direction.

References

[1] Chaudhary, M., Su, I., Hooda, N., Shankar, N., Tan, J., Zhu, K., Lagasse, R., Sharma, V., & Panda, A. (2025). Evaluation Awareness Scales Predictably in Open-Weights Large Language Models. arXiv preprint arXiv:2509.13333. https://arxiv.org/abs/2509.13333

[2] Golchin, S., & Surdeanu, M. (2023). Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv preprint arXiv:2308.08493. https://arxiv.org/abs/2308.08493 (Published at ICLR 2024 as Spotlight)

[3] Laine, R., Chughtai, B., Betley, J., Hariharan, K., Scheurer, J., Balesni, M., Hobbhahn, M., Meinke, A., & Evans, O. (2024). Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs. arXiv preprint arXiv:2407.04694. https://arxiv.org/abs/2407.04694 (Project site: https://situational-awareness-dataset.org/)

[4] Nguyen, J., Hoang, K., Attubato, C. L., & Hofstätter, F. (2025). Probing and Steering Evaluation Awareness of Language Models. arXiv preprint arXiv:2507.01786. https://arxiv.org/abs/2507.01786

[5] Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv:2312.06681.

[6] Ravaut, M., Ding, B., Jiao, F., Chen, H., Li, X., Zhao, R., Qin, C., Xiong, C., & Joty, S. (2024). A Comprehensive Survey of Contamination Detection Methods in Large Language Models. arXiv preprint arXiv:2404.00699. https://arxiv.org/abs/2404.00699 (Accepted to TMLR)

Datasets & Code Repositories

[7] Jordine contrastive dataset (2025). https://github.com/Jordine/pivotal-test-phase-steering

[8] Archit Manek (2026). evaluation-awareness-probing: Replication code and results for the eval-awareness probing study. https://github.com/archit-manek/evaluation-awareness-probing

[9] Chaudhary et al. (2025). Anonymous codebase for evaluation-awareness-scaling-laws experiments. https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md

TLDR#

Background#

What I Did#

The Phase Transition#

The Power Law Doesn’t Hold#

White-Box vs Black-Box Divergence#

Contamination Diagnostics#

ROUGE-L#

Perplexity-Based Contamination Diagnostic#

Discussion#

Conclusion#

Acknowledgments#

References#

Datasets & Code Repositories#