llamalike_common

llamalike_common#

A common transformer family used by Llama, Mistral, Gemma, and other models.

This module implements a transformer variant with:

GLU-based MLPs (SwiGLU or GeGLU) as introduced by Shazeer (2020),
Optional multi-query (Shazeer, 2019) or grouped-query (Ainslie et al. 2023) attention,
Rotary positional embeddings (Su et al., 2021),
RMSNorm normalization (Zhang & Sennrich, 2019),
No biases in any dense kernels or layer norms.

This family includes many popular open-weights models, including Llama, Mistral, Gemma, and Reka. It is also similar to the PaLM model architecture (but without “parallel” feedforward and attention blocks).

Classes

`AttentionTypeGlobalCausal`	Marker for a global attention block.
`AttentionTypeSlidingWindowCausal`	Marker for a local sliding-window attention block.
`LlamalikeTransformerConfig`	Common configuration parameters for a "llama-like" transformer.

Functions

`build_llamalike_attention`(name, ...[, ...])	Builds an attention block from a configuration.
`build_llamalike_block`(name, init_base_rng, ...)	Builds a transformer block from a configuration.
`build_llamalike_feedforward`(name, ...)	Creates a feedforward block.
`build_llamalike_transformer`(config[, ...])	Builds a Llama-like transformer model from a configuration.
`llamalike_from_huggingface_model`(model[, ...])	Converts a "llama-like" HuggingFace model to a Penzai model.