How to use from the
Use from the
PEFT library
from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-12b-it")
model = PeftModel.from_pretrained(base_model, "Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill")

Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill

This model is a fine-tuned version of Google's Gemma-4-12B-it (Instruction-tuned) base model, optimized via QLoRA SFT on the full 25,000 synthetic reasoning example dataset WithinUsAI/gemini_3.5_flash_distilled_25k using Unsloth.

🌟 Model Highlights

  • Task Alignment: Fine-tuned on high-quality synthetic traces distilled from Gemini 3.5 Flash, covering agentic code synthesis, dense context reasoning, mathematical engine traces, and systemic execution.
  • Structured Math JSON Engine Traces: The model has learned to represent complex mathematical reasoning as structured JSON engine traces (specifying problem_type, input_dimensions, execution_trace, and error_bounds), mimicking the mathematical_engine_traces subset (14% of the training dataset).
  • Robust Instruction Following: The fine-tuned model successfully mitigates long-tail repetition behaviors found in the base model (e.g. infinite regex loops) and guarantees clean, valid JSON schemas.
  • Stable Training: Achieved excellent training convergence (Final SFT loss: 0.2652) over 1 full epoch (6,250 steps) on a single RTX 4090.

📊 Evaluation & Capability Comparison

Both the base model and this fine-tuned model were evaluated side-by-side using identical prompts across six capability dimensions.

1. Quantitative Performance (25K SFT Run)

Metric Base Model Fine-Tuned Model (25K) Delta
Total Tokens Generated 5,482 6,208 +726 (more detailed/structured)
Total Inference Time 750.4s 847.0s +96.6s
Average Generation Speed 7.3 tok/s 7.3 tok/s 0.0 (identical)

Generation Length & Speed by Category

Category Base Speed (tok/s) FT Speed (tok/s) Base Tokens FT Tokens
Code Generation 7.1 7.3 1,241 939
Mathematical Reasoning 7.4 7.4 1,020 2,048
Analytical Reasoning 7.5 7.4 707 628
Instruction Following 7.3 7.3 1,195 1,345
Creative Writing 7.5 7.2 351 368
Debugging & Analysis 7.3 7.3 968 880

2. Qualitative Discoveries & Alignment Analysis

  • Mathematical Reasoning (JSON Trace Alignment): When given a math word problem ("A fair 6-sided die is rolled 5 times..."), the fine-tuned model formats its thinking process as a structured JSON engine execution trace (specifying inputs, convolution steps, and error bounds). This mimics the mathematical_engine_traces subset (14% of training dataset).
  • Instruction Following: The base model generated invalid JSON schemas due to repetitive long-tail regex loops that hit the maximum token limit. The fine-tuned model successfully generated clean, complete, and valid JSON schemas for library book inventory, rate limiting, and pagination.
  • Debugging: Both models successfully identified and corrected critical logic bugs (infinite loops, float divisions, index bounds) in python algorithms, with the fine-tuned model showing a preference for structured code blocks.

📊 Base Model Benchmarks

According to Google DeepMind's official benchmarks for the Gemma model family (with instruction-tuned evaluations):

Benchmark Gemma 4 31B Gemma 4 26B A4B Gemma 4 12B Unified Gemma 4 E4B Gemma 4 E2B Gemma 3 27B (no think)
MMLU Pro 85.2% 82.6% 77.2% 69.4% 60.0% 67.6%
AIME 2026 no tools 89.2% 88.3% 77.5% 42.5% 37.5% 20.8%
LiveCodeBench v6 80.0% 77.1% 72.0% 52.0% 44.0% 29.1%
Codeforces ELO 2150 1718 1659 940 633 110
GPQA Diamond 84.3% 82.3% 78.8% 58.6% 43.4% 42.4%
Tau2 (average over 3) 76.9% 68.2% 69.0% 42.2% 24.5% 16.2%
HLE no tools 19.5% 8.7% 5.2% - - -
HLE with search 26.5% 17.2% - - - -
BigBench Extra Hard 74.4% 64.8% 53.0% 33.1% 21.9% 19.3%
MMMLU 88.4% 86.3% 83.4% 76.6% 67.4% 70.7%
Vision: MMMU Pro 76.9% 73.8% 69.1% 52.6% 44.2% 49.7%
Vision: OmniDocBench 1.5 (edit dist, lower is better) 0.131 0.149 0.164 0.181 0.290 0.365
Vision: MATH-Vision 85.6% 82.4% 79.7% 59.5% 52.4% 46.0%
Vision: MedXPertQA MM 61.3% 58.1% 48.7% 28.7% 23.5% -

⚙️ Hyperparameters & Training Settings

The model was trained with the following hyperparameters:

Parameter Value Rationale
LoRA Rank (r) 16 Balance of parameters and memory
LoRA Alpha ($\alpha$) 32 standard 2x rank
Target Modules All linear layers q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max Sequence Length 2048 Accommodates 100% of dataset distribution
Batch Size 1 (per device) Single RTX 4090 memory constraint
Gradient Accumulation 4 Effective batch size of 4
Learning Rate 2e-4 Standard QLoRA SFT
Optimizer AdamW 8-bit Memory-efficient training
Precision 4-bit (QLoRA) Double-quantization to fit 12B model

🚀 How to Use

Loading the LoRA adapter natively with Unsloth:

import torch
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastModel.for_inference(model)

# Inference Example
messages = [{"role": "user", "content": "Write a thread-safe LRU cache with TTL in Python."}]
inputs = tokenizer(text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True), return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt Template

This model uses the standard Gemma-4 chat format:

<|turn>user
{ prompt }<|turn>model

🔒 License & Usage

This model is subject to the Gemma Terms of Use.

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill

Adapter
(2)
this model
Quantizations
1 model

Dataset used to train Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill