Gemma 4 31B-IT Quantized (INT4/INT8, HQQ)

Mixed-precision quantized version of google/gemma-4-31B-it for ExecuTorch CUDA inference.

~20 GB checkpoint, ~24 GB GPU memory at runtime.

Disclaimer: This is a quantized checkpoint intended for development and testing of the ExecuTorch CUDA export pipeline. The output quality has not been formally evaluated against the base model. Use at your own discretion for production workloads.

Quantization Details

Uses the "sensitive" recipe — quantization-sensitive weights get higher precision:

Component	Bits	Method	Group Size
Most linear layers (q, k, o, gate, up, lm_head)	INT4	HQQ (asymmetric)	32
Sensitive layers (v_proj, down_proj on layers 0-14, 45-59)	INT8	min_max	32
Embedding	INT8	min_max	per-axis
Norms, layer_scalar	bf16	—	—

HQQ (Half-Quadratic Quantization) uses iterative scale and zero-point optimization for better accuracy than standard min/max quantization.

Prerequisites

ExecuTorch installed from source (instructions)
NVIDIA GPU with CUDA toolkit (~24 GB VRAM; A100 40GB recommended)

How to Download

huggingface-cli download SocialLocalMobile/gemma-4-31B-it-HQQ-INT4 --local-dir gemma-4-31B-it-HQQ-INT4

How to Use

Eager Inference (Python)

cd executorch/examples/models/gemma4_31b
python inference.py \
    --prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
    --prompt "The capital of France is" \
    --max-new-tokens 128

Export to ExecuTorch (.pte)

cd executorch/examples/models/gemma4_31b
python export.py \
    --prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
    --output-dir ./gemma4_31b_exports \
    --backend cuda

Build and Run (C++)

make gemma4_31b-cuda

cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \
    --model_path ./gemma4_31b_exports/model.pte \
    --data_path ./gemma4_31b_exports/aoti_cuda_blob.ptd \
    --tokenizer_path /path/to/gemma-4-31B-it-HQQ-INT4/tokenizer.json \
    --prompt "Write a short joke about saving RAM." \
    --max_new_tokens 128

Files

File	Description
`model.safetensors`	Quantized weights (torchao Int4Tensor + IntxUnpackedToInt8Tensor + bf16)
`config.json`	Model architecture configuration
`tokenizer.json`	HuggingFace tokenizer
`tokenizer_config.json`	Tokenizer configuration

How to Reproduce

cd executorch/examples/models/gemma4_31b

python quantize_and_save.py \
    --model-dir /path/to/gemma-4-31B-it \
    --output /path/to/gemma-4-31B-it-HQQ-INT4 \
    --quant-recipe sensitive

Requires ~30 GB RAM. CUDA required for HQQ asymmetric quantization.