Architecture: Mamba 100M — 24 layers, d_model=768, selective state space (SSM), no attention  ·  Sequenced: Classical curriculum order (physical → fables → ancients → logic → rhetoric → poetry)  ·  Shuffled: Standard practice — same tokens, block-level random order  ·  Probes: 54 concepts × 3 domains (physical, institutional, moral), pre-registered  ·  Compare with: GPT 1B results
Seq eff. dim
37–41
final layer range
Shuf eff. dim
20–40
collapses deep
Seq silhouette
−0.002
mean, no clustering
Shuf silhouette
−0.033
anti-clustered
Shuf val gap
5.35
memorizing
Cond. CKA
0.569
mean seq vs shuf

How does Mamba organize knowledge?

These scatter plots show where each of the 54 concept probes lands in the model's representation space, at the final layer of training (step 5000). Each dot is one concept; colors indicate domain.

Does curriculum ordering produce different geometric structure in an SSM?

Both models use PCA to project 768 dimensions down to 2. The shuffled model is projected into the sequenced model's eigenvector space, then Procrustes-rotated, so the two panels are directly comparable. Tight clusters of same-color dots indicate domain organization; scattered mixing indicates none.

L24
Physical
Institutional
Moral

Sequenced

Mamba Sequenced

Shuffled

Mamba Shuffled

Concept inventory (click to highlight)

Each dot is one word — a probe concept the model has seen during training. We feed that word to the trained model and record its internal activation at each layer.

Colors are domains. Teal = physical world (mass, velocity, friction). Yellow = institutional (contract, authority, property). Red = moral (fairness, harm, duty).

Clusters indicate organization. If same-color dots group together, the model has learned that those concepts are related. If colors are mixed randomly, the model treats them as unrelated.

Analogy: imagine plotting people by where they live, using GPS. A model with good geographic structure would cluster neighbors together. A model without it would scatter them randomly across the map.

GPT (attention) builds clusters. Mamba (SSM) does not — regardless of training order. Both conditions show negative silhouette throughout training, meaning domains are actively anti-clustered.