Gemma 4 31B-IT Quantized (INT4/INT8, HQQ)

Mixed-precision quantized version of google/gemma-4-31B-it for ExecuTorch CUDA inference.

~20 GB checkpoint, ~24 GB GPU memory at runtime.

Disclaimer: This is a quantized checkpoint intended for development and testing of the ExecuTorch CUDA export pipeline. The output quality has not been formally evaluated against the base model. Use at your own discretion for production workloads.

Quantization Details

Uses the "sensitive" recipe — quantization-sensitive weights get higher precision:

Component Bits Method Group Size
Most linear layers (q, k, o, gate, up, lm_head) INT4 HQQ (asymmetric) 32
Sensitive layers (v_proj, down_proj on layers 0-14, 45-59) INT8 min_max 32
Embedding INT8 min_max per-axis
Norms, layer_scalar bf16 — —

HQQ (Half-Quadratic Quantization) uses iterative scale and zero-point optimization for better accuracy than standard min/max quantization.

Prerequisites

  • ExecuTorch installed from source (instructions)
  • NVIDIA GPU with CUDA toolkit (~24 GB VRAM; A100 40GB recommended)

How to Download

huggingface-cli download SocialLocalMobile/gemma-4-31B-it-HQQ-INT4 --local-dir gemma-4-31B-it-HQQ-INT4

How to Use

Eager Inference (Python)

cd executorch/examples/models/gemma4_31b
python inference.py \
    --prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
    --prompt "The capital of France is" \
    --max-new-tokens 128

Export to ExecuTorch (.pte)

cd executorch/examples/models/gemma4_31b
python export.py \
    --prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
    --output-dir ./gemma4_31b_exports \
    --backend cuda

Build and Run (C++)

make gemma4_31b-cuda

cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \
    --model_path ./gemma4_31b_exports/model.pte \
    --data_path ./gemma4_31b_exports/aoti_cuda_blob.ptd \
    --tokenizer_path /path/to/gemma-4-31B-it-HQQ-INT4/tokenizer.json \
    --prompt "Write a short joke about saving RAM." \
    --max_new_tokens 128

Files

File Description
model.safetensors Quantized weights (torchao Int4Tensor + IntxUnpackedToInt8Tensor + bf16)
config.json Model architecture configuration
tokenizer.json HuggingFace tokenizer
tokenizer_config.json Tokenizer configuration

How to Reproduce

cd executorch/examples/models/gemma4_31b

python quantize_and_save.py \
    --model-dir /path/to/gemma-4-31B-it \
    --output /path/to/gemma-4-31B-it-HQQ-INT4 \
    --quant-recipe sensitive

Requires ~30 GB RAM. CUDA required for HQQ asymmetric quantization.

Base Model

Downloads last month
306
Safetensors
Model size
21B params
Tensor type
BF16
·
I8
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SocialLocalMobile/gemma-4-31B-it-HQQ-INT4

Quantized
(222)
this model