Instructions to use Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-12b-it") model = PeftModel.from_pretrained(base_model, "Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Unsloth Studio
How to use Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill", max_seq_length=2048, )
Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill
This model is a fine-tuned version of Google's Gemma-4-12B-it (Instruction-tuned) base model, optimized via QLoRA SFT on the full 25,000 synthetic reasoning example dataset WithinUsAI/gemini_3.5_flash_distilled_25k using Unsloth.
🌟 Model Highlights
- Task Alignment: Fine-tuned on high-quality synthetic traces distilled from Gemini 3.5 Flash, covering agentic code synthesis, dense context reasoning, mathematical engine traces, and systemic execution.
- Structured Math JSON Engine Traces: The model has learned to represent complex mathematical reasoning as structured JSON engine traces (specifying
problem_type,input_dimensions,execution_trace, anderror_bounds), mimicking themathematical_engine_tracessubset (14% of the training dataset). - Robust Instruction Following: The fine-tuned model successfully mitigates long-tail repetition behaviors found in the base model (e.g. infinite regex loops) and guarantees clean, valid JSON schemas.
- Stable Training: Achieved excellent training convergence (Final SFT loss: 0.2652) over 1 full epoch (6,250 steps) on a single RTX 4090.
📊 Evaluation & Capability Comparison
Both the base model and this fine-tuned model were evaluated side-by-side using identical prompts across six capability dimensions.
1. Quantitative Performance (25K SFT Run)
| Metric | Base Model | Fine-Tuned Model (25K) | Delta |
|---|---|---|---|
| Total Tokens Generated | 5,482 | 6,208 | +726 (more detailed/structured) |
| Total Inference Time | 750.4s | 847.0s | +96.6s |
| Average Generation Speed | 7.3 tok/s | 7.3 tok/s | 0.0 (identical) |
Generation Length & Speed by Category
| Category | Base Speed (tok/s) | FT Speed (tok/s) | Base Tokens | FT Tokens |
|---|---|---|---|---|
| Code Generation | 7.1 | 7.3 | 1,241 | 939 |
| Mathematical Reasoning | 7.4 | 7.4 | 1,020 | 2,048 |
| Analytical Reasoning | 7.5 | 7.4 | 707 | 628 |
| Instruction Following | 7.3 | 7.3 | 1,195 | 1,345 |
| Creative Writing | 7.5 | 7.2 | 351 | 368 |
| Debugging & Analysis | 7.3 | 7.3 | 968 | 880 |
2. Qualitative Discoveries & Alignment Analysis
- Mathematical Reasoning (JSON Trace Alignment): When given a math word problem ("A fair 6-sided die is rolled 5 times..."), the fine-tuned model formats its thinking process as a structured JSON engine execution trace (specifying inputs, convolution steps, and error bounds). This mimics the
mathematical_engine_tracessubset (14% of training dataset). - Instruction Following: The base model generated invalid JSON schemas due to repetitive long-tail regex loops that hit the maximum token limit. The fine-tuned model successfully generated clean, complete, and valid JSON schemas for library book inventory, rate limiting, and pagination.
- Debugging: Both models successfully identified and corrected critical logic bugs (infinite loops, float divisions, index bounds) in python algorithms, with the fine-tuned model showing a preference for structured code blocks.
📊 Base Model Benchmarks
According to Google DeepMind's official benchmarks for the Gemma model family (with instruction-tuned evaluations):
| Benchmark | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 12B Unified | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B (no think) |
|---|---|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 77.2% | 69.4% | 60.0% | 67.6% |
| AIME 2026 no tools | 89.2% | 88.3% | 77.5% | 42.5% | 37.5% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 72.0% | 52.0% | 44.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 1659 | 940 | 633 | 110 |
| GPQA Diamond | 84.3% | 82.3% | 78.8% | 58.6% | 43.4% | 42.4% |
| Tau2 (average over 3) | 76.9% | 68.2% | 69.0% | 42.2% | 24.5% | 16.2% |
| HLE no tools | 19.5% | 8.7% | 5.2% | - | - | - |
| HLE with search | 26.5% | 17.2% | - | - | - | - |
| BigBench Extra Hard | 74.4% | 64.8% | 53.0% | 33.1% | 21.9% | 19.3% |
| MMMLU | 88.4% | 86.3% | 83.4% | 76.6% | 67.4% | 70.7% |
| Vision: MMMU Pro | 76.9% | 73.8% | 69.1% | 52.6% | 44.2% | 49.7% |
| Vision: OmniDocBench 1.5 (edit dist, lower is better) | 0.131 | 0.149 | 0.164 | 0.181 | 0.290 | 0.365 |
| Vision: MATH-Vision | 85.6% | 82.4% | 79.7% | 59.5% | 52.4% | 46.0% |
| Vision: MedXPertQA MM | 61.3% | 58.1% | 48.7% | 28.7% | 23.5% | - |
⚙️ Hyperparameters & Training Settings
The model was trained with the following hyperparameters:
| Parameter | Value | Rationale |
|---|---|---|
| LoRA Rank (r) | 16 | Balance of parameters and memory |
| LoRA Alpha ($\alpha$) | 32 | standard 2x rank |
| Target Modules | All linear layers | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max Sequence Length | 2048 | Accommodates 100% of dataset distribution |
| Batch Size | 1 (per device) | Single RTX 4090 memory constraint |
| Gradient Accumulation | 4 | Effective batch size of 4 |
| Learning Rate | 2e-4 | Standard QLoRA SFT |
| Optimizer | AdamW 8-bit | Memory-efficient training |
| Precision | 4-bit (QLoRA) | Double-quantization to fit 12B model |
🚀 How to Use
Loading the LoRA adapter natively with Unsloth:
import torch
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name="Ayodele01/Gemma-4-12B-Gemini-3.5-flash-Reasoning-Distill",
max_seq_length=2048,
load_in_4bit=True,
)
FastModel.for_inference(model)
# Inference Example
messages = [{"role": "user", "content": "Write a thread-safe LRU cache with TTL in Python."}]
inputs = tokenizer(text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True), return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Prompt Template
This model uses the standard Gemma-4 chat format:
<|turn>user
{ prompt }<|turn>model
🔒 License & Usage
This model is subject to the Gemma Terms of Use.
- Downloads last month
- 54