LLM Configuration Implications

The configuration of LLMs can be important despite having few direct applications.

BobboThe2nd·May 29, 2026

This post will describe how to configure a transformer.

Reasoning About Scale

Tuning an LLM requires reasoning about how each layer interacts with the other layers. Transformers can be defined by 5 constants:

const VOCAB;
const SEQ;
const D_MODEL;
const D_FF;
const DEPTH;

To know what these constants do to a transformer, read the Basic Transformer Architecture post. However, there has been extensive research in this field, showing that these constants should be chosen carefully. Some common decisions include:

const VOCAB = 36000-65000;
const SEQ = 512-4096;
const D_MODEL = 768-12288;
const D_FF = 4 * D_MODEL;
const DEPTH = 85-116;

There are some exceptions, like MoE (Mixture of Experts) and extremely high sequence lengths. MoE can increase D_FF to more than 32 * D_MODEL, while SEQ has been increased to more than128000.

Where did these numbers come from though? That's the reasoning part; each layer of a transformer has different complexities.

Embedding: O(SEQ*VOCAB)
Attention: O(SEQ*D_MODEL^2+SEQ^2*D_MODEL)
Feedforward: O(SEQ*D_MODEL*D_FF)
LayerNorm: O(SEQ*D_MODEL)
Output Projection: O(SEQ*D_MODEL*VOCAB)
Softmax: O(SEQ*VOCAB)

There are also a few more invariants though:

Attention and FFN (Feedforward) are used once for DEPTH layers
LayerNorm is used twice for DEPTH layers, and usually one last time at the end
Everything else activates only one time per batch

Then, there are a some conclusions that can be drawn:

Constant	Linear	Quadratic
`VOCAB`	Embedding, output projection, softmax	None
`SEQ`	All layers	Attention
`D_MODEL`	Attention, feedforward, LayerNorm	None
`D_FF`	Feedforward	None
`DEPTH`	Attention, feedforward, LayerNorm	None

Additionally, if D_FF is a multiple of D_MODEL, those fields can be combined. Then, D_MODEL has a quadratic scale in feedforward layers. It could be rewritten as:

Constant	Linear	Quadratic
`VOCAB`	Embedding, output projection, softmax	None
`SEQ`	All layers	Attention
`D_MODEL`	Attention, LayerNorm	Feedforward
`DEPTH`	Attention, feedforward, LayerNorm	None

Relationships of Constants

Assuming everything stated previously holds, the relationships can be determined. How does one constant effect the others? This will focus on the relationships between SEQ and VOCAB, D_MODEL and SEQ, and D_MODEL and DEPTH.

Constants	`VOCAB`	`D_MODEL`/`D_FF`	`DEPTH`
`SEQ`	Increasing vocab makes longer literal sequences fit into the token level sequence length of a model.	Both have quadratic effects, meaning if they are both increased linearly, it has a quartic effect on complexity.	-
`D_MODEL`/`D_FF`	-	-	As `D_MODEL` increases, it linearly accounts for the loss of stability from increasing `DEPTH`.

That is a summary of how constants interact with each other.