llamalike_common#
A common transformer family used by Llama, Mistral, Gemma, and other models.
This module implements a transformer variant with:
GLU-based MLPs (SwiGLU or GeGLU) as introduced by Shazeer (2020),
Optional multi-query (Shazeer, 2019) or grouped-query (Ainslie et al. 2023) attention,
Rotary positional embeddings (Su et al., 2021),
RMSNorm normalization (Zhang & Sennrich, 2019),
No biases in any dense kernels or layer norms.
This family includes many popular open-weights models, including Llama, Mistral, Gemma, and Reka. It is also similar to the PaLM model architecture (but without “parallel” feedforward and attention blocks).
Classes
Marker for a global attention block. |
|
Marker for a local sliding-window attention block. |
|
Common configuration parameters for a "llama-like" transformer. |
Functions
|
Builds an attention block from a configuration. |
|
Builds a transformer block from a configuration. |
|
Creates a feedforward block. |
|
Builds a Llama-like transformer model from a configuration. |
|
Converts a "llama-like" HuggingFace model to a Penzai model. |