We Should Consider Educating Models Before Training Them

Representational geometry across four training conditions

Setup: GPT (12-layer, 768d, ~92M params) trained from scratch on a 113-text classical curriculum corpus. Conditions: Educated (sequenced, 5K steps), Trained (shuffled, 5K steps), Continuation (5K educated + 5K shuffled), Extended (sequenced, 10K steps). Probing: 54 concepts (18 physical, 18 institutional, 18 moral), hidden state activations at each layer. Seeds: 5 random seeds for educated/trained; single run for continuation/extended. Error bands shown where seed data exists.

Educated gap

0.026

Train-val gap on curriculum

Trained gap

1.47

Same data, shuffled order

Continuation gap

0.46

Education then training

Seq vs Shuf CKA

0.595

5-seed mean (std 0.10)

What training order changes

Same data, same architecture, same hyperparameters. Only the sequence changes. The educated model develops progressive dimensional expansion (14 to 25 effective dimensions), clean domain clustering, and steadily increasing SAE feature diversity (1,305 to 7,445 alive features). The shuffled model collapses to 10 dimensions mid-network, loses domain structure, and shows erratic SAE behavior with a panic spike at the final layer.

Gap measures train-validation divergence on the curriculum corpus itself, not downstream task performance. A low gap indicates the model has learned generalizable patterns in the training data rather than memorizing specific sequences. The educated model's higher absolute loss but lower gap is consistent with maintaining richer internal distinctions that make next-token prediction harder for principled reasons.

Education persists through training

The continuation model (5K educated, then 5K shuffled) achieves a gap of 0.46 vs the shuffled model's 1.47. Its CKA with the educated parent is 0.756; with the shuffled model, 0.870. The geometry moves toward the shuffled model's basin, but the structural organization from education persists: domain clustering beats shuffled-only at 13 of 13 layers.

The continuation exceeds both parents

The continuation model achieves 19 to 28 effective dimensions, higher than the educated model's 14 to 25 or the shuffled model's 10 to 16. Education followed by training produces richer representations than either alone. The extended model (10K educated) reaches 20 to 33 dimensions, the richest of all four conditions.

Effective dimensionality across layers (90% variance threshold)

Educated

Trained

Continuation

Extended

The number of SVD components needed to explain 90% of variance in the concept activation matrix, per layer. Educated: steady expansion from 14 to 25. Shuffled: collapse to 10-12 mid-network with participation ratio 1.4 (near-degenerate). Continuation: 19 to 28, exceeding both parents. Extended: 20 to 33, the richest of all four.

Dimensionality and silhouette data for educated/trained conditions shown here are from the standard-corpus (113-text) single run, consistent with the published paper results. Five-seed stability is confirmed separately (gap variance across seeds: 0.10 vs 1.56).

SAE alive features per layer (of 12,288 total)

Educated

Trained

Continuation

Extended

Sparse autoencoders (L1 = 0.001, 16x expansion) trained on 200K tokens from each model. The number of features that activate above threshold reveals how the model distributes information. Four distinct signatures from the same architecture. Educated: smooth progressive expansion (1,305 to 7,445). Trained: erratic mid-network collapse to 1,083, then full-dictionary panic at L12 (12,288). Continuation: compressed early (466-725), smooth expansion to 10,165. Extended: rapid mid-network expansion with full dictionary saturation (12,288) by L10.

The extended model saturates the 16x dictionary from L7 onward. A 32x expansion (24,576 features) resolves this: alive features drop from 12,288/12,288 to 1,014/24,576 at L7 and 4,284/24,576 at L12, revealing the same progressive expansion pattern as the educated model. The 16x data is shown here for visual comparability across conditions; the saturation is a capacity artifact, not a distinct signature.

Domain clustering (silhouette score) across layers

Educated (5-seed range)

Trained (5-seed range)

Silhouette score measures whether concepts cluster by domain. The educated model maintains stronger clustering than the trained model at 12 of 13 layers (13 of 13 for the standard-corpus run).

Absolute silhouette scores are weak for both conditions (peak ~0.06 for educated, ~0.05 for trained). This is expected: 54 concepts in 768 dimensions is geometrically sparse, and any three-cluster structure will show low absolute silhouette. The meaningful signal is the consistent relative difference and the 12/13 or 13/13 layer count, not the absolute values. Permutation test p < 0.001 confirms the educated model's clustering exceeds chance levels.

Pairwise CKA matrix (mean across 13 layers)

The continuation model is geometrically closer to the trained model (0.870) than to the educated model (0.756). This is expected: 5K steps of shuffled training pulls the geometry toward the shuffled basin. But the educated structure survives within that shifted geometry, shown by the continuation model's superior domain clustering and dimensionality relative to the trained model. CKA measures overall geometric similarity; domain clustering measures structural organization. They can diverge.

Educated vs trained CKA across layers (5-seed mean +/- 1 std)

The embedding layer shares 0.983 CKA (near-identical tokenization geometry). By L3 the geometries diverge to 0.500. The gradual rise toward L12 (0.605) reflects convergence in the final prediction layer. Seed 1337 is a consistent outlier (0.794 mean CKA) while the other four seeds cluster between 0.449 and 0.568.

The cross-experiment comparison to Llama vs Qwen (CKA 0.96) is suggestive but not directly comparable: different models, different hidden dimensions, different concept inventories. The within-experiment baseline is the 5-seed spread, which confirms the seq-vs-shuf divergence exceeds seed variance at every layer.

54-concept inventory (18 per domain)

Physical (18)

Institutional (18)

Moral (18)

About the inventory

Each concept is probed via its hidden state activation when the model processes text containing that concept. Three domains spanning physical sciences, institutional structures, and moral reasoning. The inventory tests whether training order affects how models organize conceptual knowledge across fundamentally different domains.

Scope and limitations

Single architecture and scale. All results are from a 12-layer, 768-dimension GPT (~92M parameters). The curriculum ordering effect has not yet been tested on different architectures (Mamba, RWKV, AttnRes) or at larger scale (1B+ parameters). The four-architecture 1B scaling experiment is in preparation.

Single corpus. The 113-text classical curriculum is Western-philosophical in content. Whether the effect generalizes to other curricula (scientific, cross-cultural, mathematical) is untested. The cross-cultural concept inventory (MCAB, 200+ concepts) is in development.

No downstream evaluation. Gap, dimensionality, and clustering measure representational structure, not task performance. The educated model has not been evaluated on standard NLP benchmarks. At 92M parameters, benchmark performance would be dominated by scale rather than training order.

SAE dictionary saturation. The extended model saturates the 16x SAE dictionary from L10 onward. A 32x expansion confirms this is a capacity artifact: utilization drops to 12-17% at deeper layers, showing the same progressive expansion as the educated model. The three other conditions do not saturate at 16x.

Weak absolute clustering. Silhouette scores are low in absolute terms (<0.06) for all conditions. The relative comparison (educated > trained at 12-13/13 layers) is robust, but the absolute magnitude means the domain structure is subtle, not dramatic.