Basic Transformer Architecture

The simplest plausible transformer architecture, using high-level operations and a pre-LayerNorm architecture.

BobboThe2nd·

This post will describe a minimal transformer architecture, the internals of LLMs.

Specifications

It is used to model a short sequence where each token corresponds with a group of characters. All tokens are represented as one-hot vectors, stored in matrices with right-aligned padding tokens where necessary.

Architecture

It starts with a right matrix multiplication (x @ W), projecting VOCAB -> D_MODEL and jumping straight into a transformer block. Each block is implemented in a pre-LN algorithm like this pseudocode with input x:

x = LayerNorm(x)
x += Attention(x)
x = LayerNorm(x)
x += FeedForward(x)

or, more preferably,

x += Attention(LayerNorm(x))
x += FeedForward(LayerNorm(x))

The Attention function can be either self-attention or multi-head self-attention and should always be causal. The FeedForward function should project D_MODEL -> D_FF where D_FF is a multiple of D_MODEL (usually 4 * D_MODEL), with an activation function of GELU, Swish-SiLU, or SwiGLU.

The transformer finishes off with one last LayerNorm , projecting D_MODEL -> VOCAB with another right matrix multiplication, and a softmax activation to normalize the logits. It uses an Adam or SGD optimizer with cross-entropy loss that ignores padding tokens and zeroes their gradients.

The architecture can be defined by these constants:

const VOCAB;
const SEQ;
const D_MODEL;
const D_FF;
const DEPTH;

To find parameter counts from this, you can use the formula (VOCAB * D_MODEL) + (DEPTH * ((4 * D_MODEL * D_MODEL) + (2 * D_MODEL * D_FF))).

Implementation

The only implementation provided is a Rust macro invocation involving the library briny_ai:

static_model!(
    @loss cross_entropy_loss
    @optim Adam(0.001)
    @model SeqTransformer(model)
    {
        InputLayer([SEQ, VOCAB]),
        {
            // project vocab to features
            embed: Collapse([VOCAB, D_MODEL]) => CollapseLayer,

            // transformer 0
            ln0: LayerNorm([1, D_MODEL]) => LayerNormLayer,
            attn0: Residual([D_MODEL, D_MODEL], <CausalSelfAttention>) => ResidualLayer(a[SEQ, SEQ], [SEQ, D_MODEL]),
            ln1: LayerNorm([1, D_MODEL]) => LayerNormLayer,
            ff0: Residual([D_MODEL, D_FF], <FeedForward>, GELU) => ResidualLayer(a[SEQ, D_FF], [SEQ, D_MODEL]),

            // transformer 1
            ln2: LayerNorm([1, D_MODEL]) => LayerNormLayer,
            attn1: Residual([D_MODEL, D_MODEL], <CausalSelfAttention>) => ResidualLayer(a[SEQ, SEQ], [SEQ, D_MODEL]),
            ln3: LayerNorm([1, D_MODEL]) => LayerNormLayer,
            ff1: Residual([D_MODEL, D_FF], <FeedForward>, Swish) => ResidualLayer(a[SEQ, D_FF], [SEQ, D_MODEL]),

            // final normalization
            ln4: LayerNorm([1, D_MODEL]) => LayerNormLayer,

            // project features back
            extract: Collapse([D_MODEL, VOCAB]) => CollapseLayer,
            soft: Softmax([SEQ, VOCAB](1.5, 0.2)) => SoftmaxLayer,
        },
        OutputLayer([SEQ, VOCAB]),
    }
);

This implementation has only 2 transformer blocks (DEPTH = 2), but they can simply be copy-pasted to create more advanced architectures. Stability decreases as layers increase, so if deeper models are required, D_MODEL should be increased.