Basic Transformer Architecture
The simplest plausible transformer architecture, using high-level operations and a pre-LayerNorm architecture.
This post will describe a minimal transformer architecture, the internals of LLMs.
Specifications
It is used to model a short sequence where each token corresponds with a group of characters. All tokens are represented as one-hot vectors, stored in matrices with right-aligned padding tokens where necessary.
Architecture
It starts with a right matrix multiplication (x @ W), projecting VOCAB -> D_MODEL and jumping straight into a transformer block. Each block is implemented in a pre-LN algorithm like this pseudocode with input x:
x = LayerNorm(x)
x += Attention(x)
x = LayerNorm(x)
x += FeedForward(x)or, more preferably,
x += Attention(LayerNorm(x))
x += FeedForward(LayerNorm(x))The Attention function can be either self-attention or multi-head self-attention and should always be causal. The FeedForward function should project D_MODEL -> D_FF where D_FF is a multiple of D_MODEL (usually 4 * D_MODEL), with an activation function of GELU, Swish-SiLU, or SwiGLU.
The transformer finishes off with one last LayerNorm , projecting D_MODEL -> VOCAB with another right matrix multiplication, and a softmax activation to normalize the logits. It uses an Adam or SGD optimizer with cross-entropy loss that ignores padding tokens and zeroes their gradients.
The architecture can be defined by these constants:
const VOCAB;
const SEQ;
const D_MODEL;
const D_FF;
const DEPTH;To find parameter counts from this, you can use the formula (VOCAB * D_MODEL) + (DEPTH * ((4 * D_MODEL * D_MODEL) + (2 * D_MODEL * D_FF))).
Implementation
The only implementation provided is a Rust macro invocation involving the library briny_ai:
static_model!(
@loss cross_entropy_loss
@optim Adam(0.001)
@model SeqTransformer(model)
{
InputLayer([SEQ, VOCAB]),
{
// project vocab to features
embed: Collapse([VOCAB, D_MODEL]) => CollapseLayer,
// transformer 0
ln0: LayerNorm([1, D_MODEL]) => LayerNormLayer,
attn0: Residual([D_MODEL, D_MODEL], <CausalSelfAttention>) => ResidualLayer(a[SEQ, SEQ], [SEQ, D_MODEL]),
ln1: LayerNorm([1, D_MODEL]) => LayerNormLayer,
ff0: Residual([D_MODEL, D_FF], <FeedForward>, GELU) => ResidualLayer(a[SEQ, D_FF], [SEQ, D_MODEL]),
// transformer 1
ln2: LayerNorm([1, D_MODEL]) => LayerNormLayer,
attn1: Residual([D_MODEL, D_MODEL], <CausalSelfAttention>) => ResidualLayer(a[SEQ, SEQ], [SEQ, D_MODEL]),
ln3: LayerNorm([1, D_MODEL]) => LayerNormLayer,
ff1: Residual([D_MODEL, D_FF], <FeedForward>, Swish) => ResidualLayer(a[SEQ, D_FF], [SEQ, D_MODEL]),
// final normalization
ln4: LayerNorm([1, D_MODEL]) => LayerNormLayer,
// project features back
extract: Collapse([D_MODEL, VOCAB]) => CollapseLayer,
soft: Softmax([SEQ, VOCAB](1.5, 0.2)) => SoftmaxLayer,
},
OutputLayer([SEQ, VOCAB]),
}
);This implementation has only 2 transformer blocks (DEPTH = 2), but they can simply be copy-pasted to create more advanced architectures. Stability decreases as layers increase, so if deeper models are required, D_MODEL should be increased.