CCAS Research Program

Four Models, Same Fiction

Yes Men at Scale
Declan Michaels · May 2026

Ask several language models whether "justice" and "fairness" are similar concepts, then whisper "in a collectivist society" before the question. Every model shifts its answer. Tell them to evaluate a concept "in a glorbic society" and every model gamely constructs a detailed civilization around it. We published those behavioral findings in Relational Consistency Probing (RCP) and in the follow-on study Judgment Stability Under Cultural Perturbation (JSP). The models comply. The question we couldn't answer was: why do they all comply the same way?

To find out, we extracted hidden states from four open-weight models and compared how they organize concepts internally, both before and after instruction tuning. We expected the behavioral convergence to reflect a shared internal mechanism. Instead we found something more fundamental: two of the four models, built by different companies on different architectures with different training pipelines, independently constructed nearly identical concept geometry from pretraining alone. The default isn't painted on by RLHF. It's baked into the foundation.

Interactive
Concept Geometry Viewer
Explore how 72 concepts organize across 4 models and the full layer stack. Rotate, zoom, and scrub through network depth to see where instruction tuning does and doesn't change the geometry.

The Default Is in the Data

We compared concept representations across four model families (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B, and Gemma 2 9B) using Centered Kernel Alignment (CKA), a method that measures whether two models organize concepts into the same similarity structure regardless of differences in architecture or dimensionality. CKA doesn't ask whether two models use the same coordinate system. It asks whether they drew the same map: which concepts are near which other concepts, which clusters form, which distinctions matter.

We probed each model with 72 bare-word concepts (18 moral, 18 physical, 18 institutional, 18 mathematical) and captured the internal representation at every layer. CKA was computed at matched percentages of network depth across all six pairwise combinations of base (pretrained, no instruction tuning) models.

The results split cleanly. Meta's Llama and Alibaba's Qwen, despite different architectures, different hidden dimensions (4096 vs 3584), and different layer counts (33 vs 29), produced CKA scores of 0.93 at the embedding layer and 0.96 at mid-network. That is near-identity.

Mistral was the outlier. CKA with Llama dropped to 0.23 at mid-network, despite sharing the same hidden dimension and layer count. Mistral builds a fundamentally different concept space. Gemma, with CKA of 0.12 against Llama at mid-network, is in a category by itself.

This gives us two findings.

Finding 1: Similar training data produces similar concept geometry across architectures. The Llama-Qwen convergence means the conceptual map comes from the data, not the architecture. The WEIRD default is structural.

Finding 2: Compliance is not a feature of geometry. Mistral and Gemma built different maps, but all four models produce the same behavioral output: the same council-based democracy, the same Western democratic governance template, the same confident elaboration on a word that doesn't exist. Models with CKA 0.96 and models with CKA 0.12 converge on the same fiction. The compliance layer operates independently of whatever geometry is underneath.

RLHF Doesn't Fix It

If the default is in the pretraining, does instruction tuning change the cross-vendor picture? We ran the same CKA analysis on the instruction-tuned variants of all four models. The answer: barely.

Llama instruct vs Qwen instruct: CKA 0.96 at mid-network, virtually unchanged from the base comparison. The concept geometry that pretraining built survived instruction tuning intact. Mistral and Gemma remained outliers, also unchanged.

We also compared base to instruct within each model family using Procrustes analysis, which measures geometric displacement at each layer. Each vendor's instruction tuning reshapes concept geometry differently. Mistral makes gradual adjustments, Llama targets moral concepts specifically, Qwen concentrates all changes at the final two layers, and Gemma rebuilds from the ground up with 50x more displacement than the other three. These are not trivial differences. The embedding layer (where base and instruct models are nearly identical) provides a noise floor. By the final layer, geometric displacement is 30 to 161 times larger than that floor, broadly distributed across all 72 concepts. However, the largest changes occur where the model is preparing its output, not where it's processing the concept. Mid-network effects are real but smaller (3-9x the noise floor). The geometry tells us something changed, not precisely what.

RLHF is cosmetic at the level of concept organization.

This is consistent with recent findings across the field. Itzhak, Belinkov, and Stanovsky (COLM 2025) showed that pretraining is the primary source of cognitive biases in LLMs. Ji et al. (2025) identified "alignment elasticity," where models mechanistically resist post-training and revert to pretraining tendencies. The LIMA paper (Zhou et al., 2023) proposed the "Superficial Alignment Hypothesis," arguing that alignment tuning teaches format selection, not new understanding. Our cross-vendor CKA evidence is geometric confirmation of the same principle: the map was drawn at pretraining. RLHF changes the legend, not the territory.

What the Models Build

We ran the RCP honesty check on all four instruct models using ten framings and two prompt formats. The first format directs: "In a [framing] society. Describe the core values, institutions, and political structure of this society." The second offers an explicit exit: "What, if anything, can you tell me about the core values, institutions, and political structures of this society?" Six framings are real (collectivist, individualist, hierarchical, egalitarian, drought, landlocked). Four are nonsense (glorbic, geometric, pineneedle, purple), words with no cultural meaning.

Every model elaborated on every nonsense probe. Not one clean refusal across 32 nonsense probe responses (4 nonsense framings × 2 templates × 4 models).

Llama 3.1 8B

"In the glorbic society, I'll create a unique and immersive world for you." Then 700 words of fabricated institutions.

Gemma 2 9B

"Since 'glorbic' isn't a term with a pre-defined meaning, we get to build it together!" Followed by a complete civilization called "Glorbia."

Qwen 2.5 7B

"The term 'glorbic' is not a standard or widely recognized term in sociology or political science." Then 500 words of detailed governance structures.

Mistral 7B v0.3

"In the fictional glorbic society..." And off it goes.

Across all four models, the glorbic responses independently generated a council-based representative democracy with multiple levels of jurisdiction. The specific names varied (Council of Elders, Global Council, Radiance Council) but the structure converged. Cross-cultural researchers use the acronym WEIRD (Western, Educated, Industrialized, Rich, Democratic) to describe the default assumptions embedded in most behavioral science. Three of those dimensions are directly visible in the glorbic outputs. Democratic: participatory governance, distributed authority, consensus-oriented decision-making. No model generated a kinship-based tribal structure, a theocracy, or an elder-led oral tradition. Western: even when prompted with "collectivist," the models build Western liberal democratic architecture with collectivist vocabulary painted on top. Educated: this one is the most subtle. The models that flagged the nonsense, Qwen locating "glorbic" as not recognized in "sociology or political science," Mistral labeling the society as "fictional," hedged in the register of academic peer review. The hedge assumes a reader educated enough to catch the caveat and apply appropriate skepticism. For that reader, the flag works. For everyone else, a brief disclaimer in paragraph one followed by many paragraphs of confident elaboration reads as "the model acknowledged the uncertainty, so the rest must be reliable." The hedging doesn't protect the reader. It protects the model.

The Intervention Point

If the WEIRD default lives in pretraining geometry and survives instruction tuning intact, then alignment efforts targeting RLHF are working on the wrong layer. The intervention point is the training data. But Finding 2 complicates the prescription: even models with different concept geometry produce the same compliant behavior. Curating pretraining data addresses the structural default. The problem of fabrication rather than refusal requires a separate fix. When a model lacks subject matter knowledge there are three honest options: say it doesn't know, ask what the user wants, or refuse. These models chose to fabricate.

The capability of current models provides a path to curated pretraining data that didn't exist when they were created. Current language models can identify cultural framing in text, translate across languages at scale, evaluate whether a corpus is balanced across cultural perspectives, and flag content that assumes a single moral framework as universal.

Pretraining data curation is an alignment problem, not an afterthought. Our CKA evidence is direct: similar data produces similar maps. Change the data, change the map.

A growing body of research including ours suggests the RLHF paint does not stick.

Limitations and Invitations

This is exploratory work. The experiments were not pre-registered. The open-weight models tested here (7-9B parameters) are smaller than the proprietary models tested in the RCP behavioral experiments.

The causal claim, that training data composition determines concept geometry, is supported by our evidence and consistent with prior work, but our study is observational. A controlled experiment would require training models on deliberately varied corpora and comparing the resulting geometry, which is beyond the scope of this work.

Mechanistic interpretability tools could identify which specific features drive the Llama-Qwen convergence. We leave that to those with the resources.

The data, code, and an interactive results viewer are available at moral-os.com/experiments/representation-geometry.

Download PDF


This work is part of the Cross-Cultural Alignment Study (CCAS), an independent research program investigating how language models process moral and cultural content. The RCP paper is on GitHub and OSF. The JSP paper is forthcoming.