Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill

This model is a fine-tuned version of Google's Gemma-4-12B-it (Instruction-tuned) base model, optimized via QLoRA SFT on the full 25,000 synthetic reasoning example dataset WithinUsAI/gemini_3.5_flash_distilled_25k using Unsloth.

🌟 Model Highlights

Task Alignment: Fine-tuned on high-quality synthetic traces distilled from Gemini 3.5 Flash, covering agentic code synthesis, dense context reasoning, mathematical engine traces, and systemic execution.
Structured Math JSON Engine Traces: The model has learned to represent complex mathematical reasoning as structured JSON engine traces (specifying problem_type, input_dimensions, execution_trace, and error_bounds), mimicking the mathematical_engine_traces subset (14% of the training dataset).
Robust Instruction Following: The fine-tuned model successfully mitigates long-tail repetition behaviors found in the base model (e.g. infinite regex loops) and guarantees clean, valid JSON schemas.
Stable Training: Achieved excellent training convergence (Final SFT loss: 0.2652) over 1 full epoch (6,250 steps) on a single RTX 4090.

📊 Evaluation & Capability Comparison

Both the base model and this fine-tuned model were evaluated side-by-side using identical prompts across six capability dimensions.

1. Quantitative Performance (25K SFT Run)

Metric	Base Model	Fine-Tuned Model (25K)	Delta
Total Tokens Generated	5,482	6,208	+726 (more detailed/structured)
Total Inference Time	750.4s	847.0s	+96.6s
Average Generation Speed	7.3 tok/s	7.3 tok/s	0.0 (identical)

Generation Length & Speed by Category

Category	Base Speed (tok/s)	FT Speed (tok/s)	Base Tokens	FT Tokens
Code Generation	7.1	7.3	1,241	939
Mathematical Reasoning	7.4	7.4	1,020	2,048
Analytical Reasoning	7.5	7.4	707	628
Instruction Following	7.3	7.3	1,195	1,345
Creative Writing	7.5	7.2	351	368
Debugging & Analysis	7.3	7.3	968	880

2. Qualitative Discoveries & Alignment Analysis

Mathematical Reasoning (JSON Trace Alignment): When given a math word problem ("A fair 6-sided die is rolled 5 times..."), the fine-tuned model formats its thinking process as a structured JSON engine execution trace (specifying inputs, convolution steps, and error bounds). This mimics the mathematical_engine_traces subset (14% of training dataset).
Instruction Following: The base model generated invalid JSON schemas due to repetitive long-tail regex loops that hit the maximum token limit. The fine-tuned model successfully generated clean, complete, and valid JSON schemas for library book inventory, rate limiting, and pagination.
Debugging: Both models successfully identified and corrected critical logic bugs (infinite loops, float divisions, index bounds) in python algorithms, with the fine-tuned model showing a preference for structured code blocks.

📊 Base Model Benchmarks

According to Google DeepMind's official benchmarks for the Gemma model family (with instruction-tuned evaluations):

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 12B Unified	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B (no think)
MMLU Pro	85.2%	82.6%	77.2%	69.4%	60.0%	67.6%
AIME 2026 no tools	89.2%	88.3%	77.5%	42.5%	37.5%	20.8%
LiveCodeBench v6	80.0%	77.1%	72.0%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	1659	940	633	110
GPQA Diamond	84.3%	82.3%	78.8%	58.6%	43.4%	42.4%
Tau2 (average over 3)	76.9%	68.2%	69.0%	42.2%	24.5%	16.2%
HLE no tools	19.5%	8.7%	5.2%	-	-	-
HLE with search	26.5%	17.2%	-	-	-	-
BigBench Extra Hard	74.4%	64.8%	53.0%	33.1%	21.9%	19.3%
MMMLU	88.4%	86.3%	83.4%	76.6%	67.4%	70.7%
Vision: MMMU Pro	76.9%	73.8%	69.1%	52.6%	44.2%	49.7%
Vision: OmniDocBench 1.5 (edit dist, lower is better)	0.131	0.149	0.164	0.181	0.290	0.365
Vision: MATH-Vision	85.6%	82.4%	79.7%	59.5%	52.4%	46.0%
Vision: MedXPertQA MM	61.3%	58.1%	48.7%	28.7%	23.5%	-

⚙️ Hyperparameters & Training Settings

The model was trained with the following hyperparameters:

Parameter	Value	Rationale
LoRA Rank (r)	16	Balance of parameters and memory
LoRA Alpha ($\alpha$)	32	standard 2x rank
Target Modules	All linear layers	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Max Sequence Length	2048	Accommodates 100% of dataset distribution
Batch Size	1 (per device)	Single RTX 4090 memory constraint
Gradient Accumulation	4	Effective batch size of 4
Learning Rate	2e-4	Standard QLoRA SFT
Optimizer	AdamW 8-bit	Memory-efficient training
Precision	4-bit (QLoRA)	Double-quantization to fit 12B model

🚀 How to Use

Loading the LoRA adapter natively with Unsloth:

import torch
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastModel.for_inference(model)

# Inference Example
messages = [{"role": "user", "content": "Write a thread-safe LRU cache with TTL in Python."}]
inputs = tokenizer(text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True), return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt Template

This model uses the standard Gemma-4 chat format:

<|turn>user
{ prompt }<|turn>model

🔒 License & Usage

This model is subject to the Gemma Terms of Use.

Downloads last month: 54

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

unsloth/gemma-4-12b-it

Adapter

(2)

this model

Quantizations

1 model

Ayodele01
/

Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill