LLM Configuration Implications

The configuration of LLMs can be important despite having few direct applications.

BobboThe2nd·

This post will describe how to configure a transformer.

Reasoning About Scale

Tuning an LLM requires reasoning about how each layer interacts with the other layers. Transformers can be defined by 5 constants:

const VOCAB;
const SEQ;
const D_MODEL;
const D_FF;
const DEPTH;

To know what these constants do to a transformer, read the Basic Transformer Architecture post. However, there has been extensive research in this field, showing that these constants should be chosen carefully. Some common decisions include:

const VOCAB = 36000-65000;
const SEQ = 512-4096;
const D_MODEL = 768-12288;
const D_FF = 4 * D_MODEL;
const DEPTH = 85-116;

There are some exceptions, like MoE (Mixture of Experts) and extremely high sequence lengths. MoE can increase D_FF to more than 32 * D_MODEL, while SEQ has been increased to more than128000.

Where did these numbers come from though? That's the reasoning part; each layer of a transformer has different complexities.

  • Embedding: O(SEQ*VOCAB)
  • Attention: O(SEQ*D_MODEL^2)+O(SEQ^2*D_MODEL)
  • Feedforward: O(SEQ*D_MODEL*D_FF)
  • LayerNorm: O(SEQ*D_MODEL)
  • Output Projection: O(SEQ*D_MODEL*VOCAB)
  • Softmax: O(SEQ*VOCAB)

There are also a few more invariants though:

  • Attention and FFN (Feedforward) are used once for DEPTH layers
  • LayerNorm is used twice for DEPTH layers, and usually one last time at the end
  • Everything else activates only one time per batch

Then, there are a some conclusions that can be drawn:

Constant

Linear

Quadratic

VOCAB

Embedding, output projection, softmax

None

SEQ

All layers

Attention

D_MODEL

Attention, feedforward, LayerNorm

None

D_FF

Feedforward

None

DEPTH

Attention, feedforward, LayerNorm

None

Additionally, if D_FF is a multiple of D_MODEL, those fields can be combined. Then, D_MODEL has a quadratic scale in feedforward layers. It could be rewritten as:

Constant

Linear

Quadratic

VOCAB

Embedding, output projection, softmax

None

SEQ

All layers

Attention

D_MODEL

Attention, LayerNorm

Feedforward

DEPTH

Attention, feedforward, LayerNorm

None

Relationships of Constants

Assuming everything stated previously holds, the relationships can be determined. How does one constant effect the others? This will focus on the relationships between SEQ and VOCAB, D_MODEL and SEQ, and D_MODEL and DEPTH.

Constants

VOCAB

D_MODEL/D_FF

DEPTH

SEQ

Increasing vocab makes longer literal sequences fit into the token level sequence length of a model.

Both have quadratic effects, meaning if they are both increased linearly, it has a quartic effect on complexity.

-

D_MODEL/D_FF

-

-

As D_MODEL increases, it linearly accounts for the loss of stability from increasing DEPTH.

That is a summary of how constants interact with each other.