llamalike_common

llamalike_common#

A common transformer family used by Llama, Mistral, Gemma, and other models.

This module implements a transformer variant with:

  • GLU-based MLPs (SwiGLU or GeGLU) as introduced by Shazeer (2020),

  • Optional multi-query (Shazeer, 2019) or grouped-query (Ainslie et al. 2023) attention,

  • Rotary positional embeddings (Su et al., 2021),

  • RMSNorm normalization (Zhang & Sennrich, 2019),

  • No biases in any dense kernels or layer norms.

This family includes many popular open-weights models, including Llama, Mistral, Gemma, and Reka. It is also similar to the PaLM model architecture (but without “parallel” feedforward and attention blocks).

Classes

AttentionTypeGlobalCausal

Marker for a global attention block.

AttentionTypeSlidingWindowCausal

Marker for a local sliding-window attention block.

LlamalikeTransformerConfig

Common configuration parameters for a "llama-like" transformer.

Functions

build_llamalike_attention(name, ...[, ...])

Builds an attention block from a configuration.

build_llamalike_block(name, init_base_rng, ...)

Builds a transformer block from a configuration.

build_llamalike_feedforward(name, ...)

Creates a feedforward block.

build_llamalike_transformer(config[, ...])

Builds a Llama-like transformer model from a configuration.

llamalike_from_huggingface_model(model[, ...])

Converts a "llama-like" HuggingFace model to a Penzai model.