Instructions to use sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mellum2-12b-a2_5b-thinking-mxfp4-mlx sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mellum2-12b-a2_5b-thinking-mxfp4-mlx
MLX quantization of JetBrains/Mellum2-12B-A2.5B-Thinking for Apple Silicon.
Variant: Block float MX FP4
Disk size: 6165 MB
Quantized by: sahilchachra
Benchmark results
Evaluated on Apple M5 Pro with MLX. Model loaded once; performance and quality measured in a single pass.
Performance
| This model | FP16 baseline | |
|---|---|---|
| Decode tok/s (steady-state) | 134.45 | N/A |
| Prefill tok/s (steady-state) | 287.1 | N/A |
| Decode tok/s (avg, long traces) | 129.98 | N/A |
| Peak memory (GB) | 6.898 | N/A |
| Disk size (MB) | 6165 | 23183 |
Warmed, short-prompt, chat-templated, thinking disabled. Represents steady-state decode for typical chat use; long thinking traces will be slower due to KV-cache growth.
Quality
Benchmarks the upstream card also reports (JetBrains card (bf16))
The JetBrains card (bf16) column is the score published on the original model card. Our column is measured locally with this quant variant; sample sizes and prompts differ, so treat as directional.
| Benchmark | This model | JetBrains card (bf16) | n |
|---|---|---|---|
| IFEval (instruction following) | 63.6% | 76.5% | 44 |
| MMLU (knowledge, accuracy) | 90.0% | 86.2% (MMLU-Redux) | 50 |
Additional benchmarks (our suite)
These benchmarks are not on the upstream card. No external reference; FP16 baseline column reflects local fp16 runs if any.
| Benchmark | This model | FP16 baseline | n |
|---|---|---|---|
| MATH-500 (math reasoning) | 80.0% (answered 28/30) | N/A | 30 |
| HumanEval (code, pass@1) | 93.3% | N/A | 30 |
MATH-500 per-level accuracy
| Level | This model | FP16 baseline |
|---|---|---|
| level 1 | 83.3% | N/A |
| level 2 | 100.0% | N/A |
| level 3 | 66.7% | N/A |
| level 4 | 66.7% | N/A |
| level 5 | 83.3% | N/A |
Context scaling (decode tok/s)
| Context length | Decode tok/s |
|---|---|
| ~128 tokens | 135.1 |
| ~256 tokens | 134.0 |
| ~512 tokens | 133.9 |
| ~1024 tokens | 131.8 |
Usage
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=256, verbose=True)
Heads-up for Mellum2: mlx-lm support landed in PR #1339 and may not yet be in the released pypi package. If
load(...)complains about an unknownmellummodel type, install the PR branch:pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1339/head"Also note: this repo ships a fixed
eos_token_id=28(<|im_end|>) inconfig.jsonandgeneration_config.json— the JetBrains source haseos_token_id=0(<|endoftext|>) which the chat template never emits, so generation runs tomax_tokensevery call. The fix is already applied here.
All variants in this collection
| Model | Variant |
|---|---|
| sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx | Block float MX FP4 ← this model |
| sahilchachra/mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx | OptiQ mixed-precision (target 5.0 bpw) |
Notes
- Requires Apple Silicon (M1 or later) with MLX
- Benchmarks run on Apple M5 Pro, 24 GB unified memory
- License: see JetBrains/Mellum2-12B-A2.5B-Thinking for the original model's license
Original model
See JetBrains/Mellum2-12B-A2.5B-Thinking for full model details and intended use.
- Downloads last month
- 133
4-bit
Model tree for sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx
Base model
JetBrains/Mellum2-12B-A2.5B-Thinking