LLM Configuration Implications
The configuration of LLMs can be important despite having few direct applications.
This post will describe how to configure a transformer.
Reasoning About Scale
Tuning an LLM requires reasoning about how each layer interacts with the other layers. Transformers can be defined by 5 constants:
const VOCAB;
const SEQ;
const D_MODEL;
const D_FF;
const DEPTH;To know what these constants do to a transformer, read the Basic Transformer Architecture post. However, there has been extensive research in this field, showing that these constants should be chosen carefully. Some common decisions include:
const VOCAB = 36000-65000;
const SEQ = 512-4096;
const D_MODEL = 768-12288;
const D_FF = 4 * D_MODEL;
const DEPTH = 85-116;There are some exceptions, like MoE (Mixture of Experts) and extremely high sequence lengths. MoE can increase D_FF to more than 32 * D_MODEL, while SEQ has been increased to more than128000.
Where did these numbers come from though? That's the reasoning part; each layer of a transformer has different complexities.
- Embedding:
O(SEQ*VOCAB) - Attention:
O(SEQ*D_MODEL^2)+O(SEQ^2*D_MODEL) - Feedforward:
O(SEQ*D_MODEL*D_FF) - LayerNorm:
O(SEQ*D_MODEL) - Output Projection:
O(SEQ*D_MODEL*VOCAB) - Softmax:
O(SEQ*VOCAB)
There are also a few more invariants though:
- Attention and FFN (Feedforward) are used once for
DEPTHlayers - LayerNorm is used twice for
DEPTHlayers, and usually one last time at the end - Everything else activates only one time per batch
Then, there are a some conclusions that can be drawn:
Constant | Linear | Quadratic |
| Embedding, output projection, softmax | None |
| All layers | Attention |
| Attention, feedforward, LayerNorm | None |
| Feedforward | None |
| Attention, feedforward, LayerNorm | None |
Additionally, if D_FF is a multiple of D_MODEL, those fields can be combined. Then, D_MODEL has a quadratic scale in feedforward layers. It could be rewritten as:
Constant | Linear | Quadratic |
| Embedding, output projection, softmax | None |
| All layers | Attention |
| Attention, LayerNorm | Feedforward |
| Attention, feedforward, LayerNorm | None |
Relationships of Constants
Assuming everything stated previously holds, the relationships can be determined. How does one constant effect the others? This will focus on the relationships between SEQ and VOCAB, D_MODEL and SEQ, and D_MODEL and DEPTH.
Constants |
|
|
|
| Increasing vocab makes longer literal sequences fit into the token level sequence length of a model. | Both have quadratic effects, meaning if they are both increased linearly, it has a quartic effect on complexity. | - |
| - | - | As |
That is a summary of how constants interact with each other.