mellum2-12b-a2_5b-thinking-mxfp4-mlx

MLX quantization of JetBrains/Mellum2-12B-A2.5B-Thinking for Apple Silicon.

Variant: Block float MX FP4
Disk size: 6165 MB
Quantized by: sahilchachra

Benchmark results

Evaluated on Apple M5 Pro with MLX. Model loaded once; performance and quality measured in a single pass.

Performance

	This model	FP16 baseline
Decode tok/s (steady-state)	134.45	N/A
Prefill tok/s (steady-state)	287.1	N/A
Decode tok/s (avg, long traces)	129.98	N/A
Peak memory (GB)	6.898	N/A
Disk size (MB)	6165	23183

Warmed, short-prompt, chat-templated, thinking disabled. Represents steady-state decode for typical chat use; long thinking traces will be slower due to KV-cache growth.

Quality

Benchmarks the upstream card also reports (JetBrains card (bf16))

The JetBrains card (bf16) column is the score published on the original model card. Our column is measured locally with this quant variant; sample sizes and prompts differ, so treat as directional.

Benchmark	This model	JetBrains card (bf16)	n
IFEval (instruction following)	63.6%	76.5%	44
MMLU (knowledge, accuracy)	90.0%	86.2% (MMLU-Redux)	50

Additional benchmarks (our suite)

These benchmarks are not on the upstream card. No external reference; FP16 baseline column reflects local fp16 runs if any.

Benchmark	This model	FP16 baseline	n
MATH-500 (math reasoning)	80.0% (answered 28/30)	N/A	30
HumanEval (code, pass@1)	93.3%	N/A	30

MATH-500 per-level accuracy

Level	This model	FP16 baseline
level 1	83.3%	N/A
level 2	100.0%	N/A
level 3	66.7%	N/A
level 4	66.7%	N/A
level 5	83.3%	N/A

Context scaling (decode tok/s)

Context length	Decode tok/s
~128 tokens	135.1
~256 tokens	134.0
~512 tokens	133.9
~1024 tokens	131.8

Usage

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx")
response = generate(model, tokenizer, prompt="Your prompt here", max_tokens=256, verbose=True)

Heads-up for Mellum2: mlx-lm support landed in PR #1339 and may not yet be in the released pypi package. If load(...) complains about an unknown mellum model type, install the PR branch:
pip install "git+https://github.com/ml-explore/mlx-lm.git@refs/pull/1339/head"
Also note: this repo ships a fixed eos_token_id=28 (<|im_end|>) in config.json and generation_config.json — the JetBrains source has eos_token_id=0 (<|endoftext|>) which the chat template never emits, so generation runs to max_tokens every call. The fix is already applied here.

All variants in this collection

Model	Variant
sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx	Block float MX FP4 ← this model
sahilchachra/mellum2-12b-a2_5b-thinking-optiq-5bpw-mlx	OptiQ mixed-precision (target 5.0 bpw)

Notes

Requires Apple Silicon (M1 or later) with MLX
Benchmarks run on Apple M5 Pro, 24 GB unified memory
License: see JetBrains/Mellum2-12B-A2.5B-Thinking for the original model's license

Original model

See JetBrains/Mellum2-12B-A2.5B-Thinking for full model details and intended use.

Downloads last month: 133

Safetensors

Model size

12B params

Tensor type

U32

BF16

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/mellum2-12b-a2_5b-thinking-mxfp4-mlx

Base model

JetBrains/Mellum2-12B-A2.5B-Thinking

Quantized

(26)

this model