Gemma 4 31B-IT Quantized (INT4/INT8, HQQ)
Mixed-precision quantized version of google/gemma-4-31B-it for ExecuTorch CUDA inference.
~20 GB checkpoint, ~24 GB GPU memory at runtime.
Disclaimer: This is a quantized checkpoint intended for development and testing of the ExecuTorch CUDA export pipeline. The output quality has not been formally evaluated against the base model. Use at your own discretion for production workloads.
Quantization Details
Uses the "sensitive" recipe — quantization-sensitive weights get higher precision:
| Component | Bits | Method | Group Size |
|---|---|---|---|
| Most linear layers (q, k, o, gate, up, lm_head) | INT4 | HQQ (asymmetric) | 32 |
| Sensitive layers (v_proj, down_proj on layers 0-14, 45-59) | INT8 | min_max | 32 |
| Embedding | INT8 | min_max | per-axis |
| Norms, layer_scalar | bf16 | — | — |
HQQ (Half-Quadratic Quantization) uses iterative scale and zero-point optimization for better accuracy than standard min/max quantization.
Prerequisites
- ExecuTorch installed from source (instructions)
- NVIDIA GPU with CUDA toolkit (~24 GB VRAM; A100 40GB recommended)
How to Download
huggingface-cli download SocialLocalMobile/gemma-4-31B-it-HQQ-INT4 --local-dir gemma-4-31B-it-HQQ-INT4
How to Use
Eager Inference (Python)
cd executorch/examples/models/gemma4_31b
python inference.py \
--prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
--prompt "The capital of France is" \
--max-new-tokens 128
Export to ExecuTorch (.pte)
cd executorch/examples/models/gemma4_31b
python export.py \
--prequantized /path/to/gemma-4-31B-it-HQQ-INT4 \
--output-dir ./gemma4_31b_exports \
--backend cuda
Build and Run (C++)
make gemma4_31b-cuda
cmake-out/examples/models/gemma4_31b/gemma4_31b_runner \
--model_path ./gemma4_31b_exports/model.pte \
--data_path ./gemma4_31b_exports/aoti_cuda_blob.ptd \
--tokenizer_path /path/to/gemma-4-31B-it-HQQ-INT4/tokenizer.json \
--prompt "Write a short joke about saving RAM." \
--max_new_tokens 128
Files
| File | Description |
|---|---|
model.safetensors |
Quantized weights (torchao Int4Tensor + IntxUnpackedToInt8Tensor + bf16) |
config.json |
Model architecture configuration |
tokenizer.json |
HuggingFace tokenizer |
tokenizer_config.json |
Tokenizer configuration |
How to Reproduce
cd executorch/examples/models/gemma4_31b
python quantize_and_save.py \
--model-dir /path/to/gemma-4-31B-it \
--output /path/to/gemma-4-31B-it-HQQ-INT4 \
--quant-recipe sensitive
Requires ~30 GB RAM. CUDA required for HQQ asymmetric quantization.
Base Model
- Model: google/gemma-4-31B-it
- Architecture: 60-layer dense transformer, hybrid sliding/full attention
- Parameters: 31B
- License: Gemma 4 License (Apache 2.0)
- Downloads last month
- 306
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support