Do LLMs Grow Internal “Modules” During Post‑Training?

A more precise vocabulary

The intuitive labels — logic module, math module, ethics module — are useful for conversation. Mechanistically, the evidence suggests several different kinds of internal structure.

FeaturesDirections or patterns in activation space corresponding to concepts, entities, behaviors, or contexts.

CircuitsInteracting components, such as attention heads and MLPs, that implement a specific computation.

Task / function vectorsCompact internal representations of “what task is being demonstrated here?”

Skill neuronsNeurons whose activations are predictive or causal for task performance, though often embedded in larger distributed systems.

Behavior directionsSteerable population-level directions for truthfulness, refusal, harmlessness, confidence, or style.

SuperpositionMany features packed into shared neurons/dimensions, making clean modular boundaries hard to recover.

Human phrase → mechanistic-ish counterpart

Human phrase	More careful internal interpretation
“Math module”	Reasoning circuits, scratchpad-like token trajectories, algorithmic features, attention/MLP subnetworks, and learned selection policies.
“Coding module”	Code-token syntax features, API/library knowledge, program-structure representations, and reasoning patterns learned from code corpora.
“Ethics / safety module”	Refusal and harmlessness directions, policy-conditioned behavioral gates, and post-training-induced response preferences.
“Truthfulness module”	Latent truth/factuality directions plus readout/calibration mechanisms that determine whether the model says what it internally represents.
“Personality / style module”	Highly steerable activation-space directions and learned response policies for tone, verbosity, deference, etc.

Curated paper map

Grouped by what each line of work tells us about “modules” growing or becoming accessible inside LLMs.

1. Post‑training reshapes internal representations

How Post‑Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence2025

The most directly relevant paper here. It compares base and post-trained models and studies internal changes in knowledge, truthfulness, refusal, and confidence. It reports that truthfulness and refusal can be represented by vectors in hidden space, while factual knowledge locations are not simply overwritten.

LLM Post‑Training: A Deep Dive into Reasoning Large Language Models2025

A broad survey of SFT, RL, test-time scaling, reasoning post-training, and alignment. Useful background for how post-training changes model behavior, though less mechanistic than the papers below.

Making Large Language Models Better Reasoners with Alignment2023

Shows that alignment/fine-tuning can improve reasoning behavior, but can also create self-assessment or “assessment misalignment” issues. Good reminder that better outputs do not necessarily mean cleaner internal cognition.

2. High‑level behavior can appear as activation directions

Representation Engineering: A Top‑Down Approach to AI Transparency2023

Frames interpretability around population-level representations rather than individual neurons. Demonstrates monitoring/manipulation of safety-relevant phenomena such as honesty, harmlessness, and power-seeking.

Inference‑Time Intervention: Eliciting Truthful Answers from a Language Model2023

Finds directions across selected attention heads that improve TruthfulQA performance when applied at inference time. Strong evidence that “truthfulness” is partly encoded in steerable internal directions.

Discovering Latent Knowledge in Language Models Without Supervision2022

Finds truth-like directions in activations using logical consistency properties, without labeled outputs. Important for the idea that models can internally represent facts they do not reliably say.

Function Vectors in Large Language Models2023

Shows that in-context input-output tasks can be represented as compact vectors transported by a small number of attention heads. This is one of the clearest “module-like” findings: a task can become a reusable internal object.

3. Skill neurons and factual-memory localization

Finding Skill Neurons in Pre‑trained Transformer‑based Language Models2022

Finds task-specific neurons after prompt tuning and shows perturbing them hurts task performance. Important caveat: the authors argue these structures are mostly generated in pretraining rather than fine-tuning.

Neuron‑Level Knowledge Attribution in Large Language Models2023

Attempts neuron-level attribution for predictions. Useful for the localization question: are skills and knowledge stored in identifiable neurons, or only in distributed populations?

Locating and Editing Factual Associations in GPT2022

The ROME paper. Finds that middle-layer MLP computations mediate factual associations and can be directly edited. Good evidence for semi-local factual-memory mechanisms.

Mass‑Editing Memory in a Transformer2022

MEMIT extends direct model editing to many factual memories. Helpful for thinking about where “knowledge structures” live and how localized they are.

4. Features, circuits, and why modules are messy

Toy Models of Superposition2022

Explains why many concepts can be packed into shared neurons or dimensions. This is the central reason “modules” are hard to find cleanly: internal representations are often superposed and polysemantic.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning2023

Uses sparse autoencoders to decompose messy activations into more interpretable features. A major step toward mapping the “feature ecology” inside models.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet2024

Scales sparse-autoencoder feature extraction to a production-scale model. Finds many meaningful features for entities, concepts, contexts, and behaviors.

In‑context Learning and Induction Heads2022

A concrete example of an emergent transformer circuit supporting in-context pattern copying and induction. Useful as a clear case where a real computation was reverse-engineered.

Challenges in Mechanistically Interpreting Model Representations2024

A cautionary paper: many safety-relevant behaviors are not simple token-aligned circuits. It argues for representation-level interpretability alongside circuit-level work.

What post‑training probably does

My synthesis: post-training usually does not create a clean new “ethics module” or “math module” from scratch. It changes which latent structures are easy to activate, which outputs are preferred, and which internal signals are routed into the final answer.

Domain	Likely story
Math / coding	Mostly pretrained capabilities plus post-training that teaches better deployment, step-by-step behavior, and selection among strategies.
Reasoning	Post-training can encourage longer trajectories, self-checking patterns, and better use of latent strategies; not necessarily a single reasoning organ.
Ethics / refusal / helpfulness	More post-training-shaped: behavioral control directions, refusal gates, preference policies, and instruction-following norms.
Truthfulness / confidence	Partly latent in pretrained representations; post-training changes readout, calibration, and whether the model says what it internally represents.
Style / personality	Often highly steerable and vector-like: tone, verbosity, politeness, and role behavior can be shifted by prompts or activation interventions.

Bottom line

The deepest open question is whether these internal structures are stable “organs” or context-dependent activation patterns. Current evidence suggests both: reusable circuits and directions exist, but they are entangled, distributed, and superposed rather than clean folders named /logic, /math, /coding, or /ethics.

Do LLMs grow internal “modules” during post‑training?

The short answer

A more precise vocabulary

Human phrase → mechanistic-ish counterpart

Curated paper map

1. Post‑training reshapes internal representations

2. High‑level behavior can appear as activation directions

3. Skill neurons and factual-memory localization

4. Features, circuits, and why modules are messy

What post‑training probably does

Suggested reading path

Bottom line