Mechanistic interpretability · post‑training · alignment

Do LLMs grow internal “modules” during post‑training?

A readable guide to the evidence for emergent structures inside language models: features, circuits, task vectors, skill neurons, truthfulness/refusal directions, and why “module” is useful but misleading.

The short answer

Yes, it is reasonable to talk about internal structures that behave a bit like modules. But current evidence points less to clean software-like packages and more to semi-local, reusable latent structures: sparse features, circuits, attention heads, MLP subspaces, task vectors, and activation-space directions.

Pretraining grows a large ecology of latent features and circuits. Post-training usually reshapes access, routing, readout, and behavioral gating rather than installing clean new organs named “math”, “coding”, or “ethics”.

A more precise vocabulary

The intuitive labels — logic module, math module, ethics module — are useful for conversation. Mechanistically, the evidence suggests several different kinds of internal structure.

FeaturesDirections or patterns in activation space corresponding to concepts, entities, behaviors, or contexts.
CircuitsInteracting components, such as attention heads and MLPs, that implement a specific computation.
Task / function vectorsCompact internal representations of “what task is being demonstrated here?”
Skill neuronsNeurons whose activations are predictive or causal for task performance, though often embedded in larger distributed systems.
Behavior directionsSteerable population-level directions for truthfulness, refusal, harmlessness, confidence, or style.
SuperpositionMany features packed into shared neurons/dimensions, making clean modular boundaries hard to recover.

Human phrase → mechanistic-ish counterpart

Human phraseMore careful internal interpretation
“Math module”Reasoning circuits, scratchpad-like token trajectories, algorithmic features, attention/MLP subnetworks, and learned selection policies.
“Coding module”Code-token syntax features, API/library knowledge, program-structure representations, and reasoning patterns learned from code corpora.
“Ethics / safety module”Refusal and harmlessness directions, policy-conditioned behavioral gates, and post-training-induced response preferences.
“Truthfulness module”Latent truth/factuality directions plus readout/calibration mechanisms that determine whether the model says what it internally represents.
“Personality / style module”Highly steerable activation-space directions and learned response policies for tone, verbosity, deference, etc.

Curated paper map

Grouped by what each line of work tells us about “modules” growing or becoming accessible inside LLMs.

1. Post‑training reshapes internal representations

2. High‑level behavior can appear as activation directions

Shows that in-context input-output tasks can be represented as compact vectors transported by a small number of attention heads. This is one of the clearest “module-like” findings: a task can become a reusable internal object.

3. Skill neurons and factual-memory localization

MEMIT extends direct model editing to many factual memories. Helpful for thinking about where “knowledge structures” live and how localized they are.

4. Features, circuits, and why modules are messy

Explains why many concepts can be packed into shared neurons or dimensions. This is the central reason “modules” are hard to find cleanly: internal representations are often superposed and polysemantic.

A concrete example of an emergent transformer circuit supporting in-context pattern copying and induction. Useful as a clear case where a real computation was reverse-engineered.

What post‑training probably does

My synthesis: post-training usually does not create a clean new “ethics module” or “math module” from scratch. It changes which latent structures are easy to activate, which outputs are preferred, and which internal signals are routed into the final answer.

DomainLikely story
Math / codingMostly pretrained capabilities plus post-training that teaches better deployment, step-by-step behavior, and selection among strategies.
ReasoningPost-training can encourage longer trajectories, self-checking patterns, and better use of latent strategies; not necessarily a single reasoning organ.
Ethics / refusal / helpfulnessMore post-training-shaped: behavioral control directions, refusal gates, preference policies, and instruction-following norms.
Truthfulness / confidencePartly latent in pretrained representations; post-training changes readout, calibration, and whether the model says what it internally represents.
Style / personalityOften highly steerable and vector-like: tone, verbosity, politeness, and role behavior can be shifted by prompts or activation interventions.

Suggested reading path

If you want the shortest path through the topic, read in this order.

Start with post-trainingRead How Post‑Training Reshapes LLMs for the directly relevant base-vs-post-trained comparison.
Then learn vectorsRead Representation Engineering, ITI, and Function Vectors.
Then add localizationRead Skill Neurons, ROME, and MEMIT.
Finish with caveatsRead Toy Models of Superposition and Anthropic’s monosemanticity work to understand why clean modules are hard.

Bottom line

The deepest open question is whether these internal structures are stable “organs” or context-dependent activation patterns. Current evidence suggests both: reusable circuits and directions exist, but they are entangled, distributed, and superposed rather than clean folders named /logic, /math, /coding, or /ethics.