--- license: apache-2.0 language: - en - fr - de - es - pt - it - ja - ko - ru - zh - ar - fa - id - ms - ne - pl - ro - sr - sv - tr - uk - vi - hi - bn tags: - vLLM --- # Mistral Small 4 119B A6B Mistral Small 4 is a powerful hybrid model capable of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families—**Instruct**, **Reasoning** (previously called Magistral), and **Devstral**—into a single, unified model. With its multimodal capabilities, efficient architecture, and flexible mode switching, it is a powerful general-purpose model for any task. In a latency-optimized setup, Mistral Small 4 achieves a **40% reduction in end-to-end completion time**, and in a throughput-optimized setup, it handles **3x more requests per second** compared to Mistral Small 3. To further improve efficiency you can either take advantages of: - Speculative decoding thanks to our trained eagle head [`mistralai/Mistral-Small-4-119B-2603-eagle`](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle). - 4 bit float precision quantization thanks to our NVFP4 checkpoint [`mistralai/Mistral-Small-4-119B-2603-NVFP4`](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-NVFP4). ## Key Features Mistral Small 4 includes the following architectural choices: - **MoE**: 128 experts, 4 active. - **119B parameters**, with **6.5B activated per token**. - **256k context length**. - **Multimodal input**: Accepts both text and image input, with text output. - **Instruct and Reasoning functionalities** with function calls (reasoning effort configurable per request). Mistral Small 4 offers the following capabilities: - **Reasoning Mode**: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested. - **Vision**: Analyzes images and provides insights based on visual content, in addition to text. - **Multilingual**: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic. - **System Prompt**: Strong adherence and support for system prompts. - **Agentic**: Best-in-class agentic capabilities with native function calling and JSON output. - **Speed-Optimized**: Delivers best-in-class performance and speed. - **Apache 2.0 License**: Open-source license for both commercial and non-commercial use. - **Large Context Window**: Supports a 256k context window. ## Recommended Settings - **Reasoning Effort**: - `'none'` → Do not use reasoning - `'high'` → Use reasoning (recommended for complex prompts) Use `reasoning_effort="high"` for complex tasks - **Temperature**: 0.7 for `reasoning_effort="high"`. Temp between 0.0 and 0.7 for `reasoning_effort="none"` depending on task. ## Use Cases Mistral Small 4 is designed for general chat assistants, coding, agentic tasks, and reasoning tasks (with reasoning mode toggled). Its multimodal capabilities also enable document and image understanding for data extraction and analysis. Its capabilities are ideal for: - Developers interested in coding and agentic capabilities for SWE automation and codebase exploration. - Enterprises seeking general chat assistants, agents, and document understanding. - Researchers leveraging its math and research capabilities. Mistral Small 4 is also well-suited for customization and fine-tuning for more specialized tasks. ### Examples - General chat assistant - Document parsing and extraction - Coding agent - Research assistant - Customization & fine-tuning - And more... ## Benchmarks ### Comparison with internal models Depending on your tasks you can trigger reasoning thanks to the support of the **per-request** parameter `reasoning_effort`. Set it to: - `reasoning_effort="none"`: Fast, lightweight responses for everyday tasks, equivalent to the same chat style of [`mistralai/Mistral-Small-3.2-24B-Instruct-2506`](https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506). - `reasoning_effort="high"`: Deep, step-by-step reasoning for complex problems, with equivalent verbosity to previous Magistral models such as [`mistralai/Magistral-Small-2509`](https://huggingface.co/mistralai/Magistral-Small-2509). ![Internal benchmark](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/resolve/main/images/image2.png) #### Comparing Reasoning Models ![Internal benchmark - Reasoning](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/resolve/main/images/image3.png) ### Comparison with other models Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B across all three benchmarks while generating significantly shorter outputs. On AA LCR, Mistral Small 4 scores **0.72** with just **1.6K characters**, whereas Qwen models require **3.5-4x more output** (5.8-6.1K) for comparable performance. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B while producing **20% less output**. This efficiency reduces latency, inference costs, and improves user experience. ![Comparison benchmark - LCR](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/resolve/main/images/lcr.png) ![Comparison benchmark - LiveCodeBench](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/resolve/main/images/livecode.png) ![Comparison benchmark - AIME25](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/resolve/main/images/aime.png) ## Usage You can find Mistral Small 4 support on multiple libraries for inference and fine-tuning. We here thank every contributors and maintainers that helped us making it happen. ### Inference The model can be deployed with: - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended) - [`llama.cpp`](https://github.com/ggml-org/llama.cpp): See [here](https://huggingface.co/unsloth/Mistral-Small-4-119B-2603-GGUF) for Unsloth's GGUFs - [`LM studio`](https://lmstudio.ai/): See [here](https://lmstudio.ai/models/mistralai/mistral-small-4) - [`SGLang`](https://github.com/sgl-project/sglang): See [here](https://docs.sglang.io/basic_usage/send_request.html) - [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers) For optimal performance, we recommend using the Mistral AI API if local serving is subpar. ### Fine-Tuning Fine-tune the model via: - [`Axolotl`](https://github.com/axolotl-ai-cloud/axolotl): See [here](https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/mistral4). ## vLLM (Recommended) We recommend using Mistral Small 4 with the [vLLM library](https://github.com/vllm-project/vllm) for production-ready inference. ### Installation Install latest **vllm**: ``` uv pip install -U vllm ``` Doing so should automatically install [`mistral_common >= 1.11.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.11.0). To check: ``` python -c "import mistral_common; print(mistral_common.__version__)" ``` You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest). ### Serve the Model We recommend a server/client setup: ```bash vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \ --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \ --gpu_memory_utilization 0.8 ``` ### Ping the Server
Instruction Following Mistral Small 4 can follow your instructions to the letter. ```python from datetime import datetime, timedelta from openai import OpenAI from huggingface_hub import hf_hub_download # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" TEMP = 0.1 # use TEMP = 0.7 for reasoning="high" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id def load_system_prompt(repo_id: str, filename: str) -> str: file_path = hf_hub_download(repo_id=repo_id, filename=filename) with open(file_path, "r") as file: system_prompt = file.read() today = datetime.today().strftime("%Y-%m-%d") yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") model_name = repo_id.split("/")[-1] return system_prompt.format(name=model_name, today=today, yesterday=yesterday) SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt") messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.", }, ] response = client.chat.completions.create( model=model, messages=messages, temperature=TEMP, reasoning_effort="none", ) assistant_message = response.choices[0].message.content print(assistant_message) ```
Tool Call Let's solve some equations thanks to our simple Python calculator tool. ```python import json from datetime import datetime, timedelta from openai import OpenAI from huggingface_hub import hf_hub_download # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" TEMP = 0.1 client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id def load_system_prompt(repo_id: str, filename: str) -> str: file_path = hf_hub_download(repo_id=repo_id, filename=filename) with open(file_path, "r") as file: system_prompt = file.read() today = datetime.today().strftime("%Y-%m-%d") yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") model_name = repo_id.split("/")[-1] return system_prompt.format(name=model_name, today=today, yesterday=yesterday) SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt") image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg" def my_calculator(expression: str) -> str: return str(eval(expression)) tools = [ { "type": "function", "function": { "name": "my_calculator", "description": "A calculator that can evaluate a mathematical expression.", "parameters": { "type": "object", "properties": { "expression": { "type": "string", "description": "The mathematical expression to evaluate.", }, }, "required": ["expression"], }, }, }, { "type": "function", "function": { "name": "rewrite", "description": "Rewrite a given text for improved clarity", "parameters": { "type": "object", "properties": { "text": { "type": "string", "description": "The input text to rewrite", } }, }, }, }, ] messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ { "type": "text", "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.", }, { "type": "image_url", "image_url": { "url": image_url, }, }, ], }, ] response = client.chat.completions.create( model=model, messages=messages, temperature=TEMP, tools=tools, tool_choice="auto", reasoning_effort="none", ) tool_calls = response.choices[0].message.tool_calls results = [] for tool_call in tool_calls: function_name = tool_call.function.name function_args = tool_call.function.arguments if function_name == "my_calculator": result = my_calculator(**json.loads(function_args)) results.append(result) messages.append({"role": "assistant", "tool_calls": tool_calls}) for tool_call, result in zip(tool_calls, results): messages.append( { "role": "tool", "tool_call_id": tool_call.id, "name": tool_call.function.name, "content": result, } ) response = client.chat.completions.create( model=model, messages=messages, temperature=TEMP, reasoning_effort="none", ) print(response.choices[0].message.content) ```
Vision Reasoning Let's see if the Mistral Small 4 knows when to pick a fight ! ```python from datetime import datetime, timedelta from openai import OpenAI from huggingface_hub import hf_hub_download # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" TEMP = 0.7 client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id def load_system_prompt(repo_id: str, filename: str) -> str: file_path = hf_hub_download(repo_id=repo_id, filename=filename) with open(file_path, "r") as file: system_prompt = file.read() today = datetime.today().strftime("%Y-%m-%d") yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") model_name = repo_id.split("/")[-1] return system_prompt.format(name=model_name, today=today, yesterday=yesterday) SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt") image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, { "role": "user", "content": [ { "type": "text", "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.", }, {"type": "image_url", "image_url": {"url": image_url}}, ], }, ] response = client.chat.completions.create( model=model, messages=messages, temperature=TEMP, reasoning_effort="high", ) print(response.choices[0].message.content) ```
## Transformers ### Installation You need to install the main branch of Transformers to use Mistral Small 4: ```bash uv pip install git+https://github.com/huggingface/transformers.git ``` ### Inference
Python Inference Snippet ```python import torch from transformers import AutoProcessor, Mistral3ForConditionalGeneration model_id = "mistralai/Mistral-Small-4-119B-2603" processor = AutoProcessor.from_pretrained(model_id) model = Mistral3ForConditionalGeneration.from_pretrained( model_id, device_map="auto" ) image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438" messages = [ { "role": "user", "content": [ { "type": "text", "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.", }, {"type": "image_url", "image_url": {"url": image_url}}, ], }, ] inputs = processor.apply_chat_template(messages, return_tensors="pt", tokenize=True, return_dict=True, reasoning_effort="high") inputs = inputs.to(model.device) output = model.generate( **inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, )[0] # Setting `skip_special_tokens=False` to visualize reasoning trace between [THINK] [/THINK] tags. decoded_output = processor.decode(output[len(inputs["input_ids"][0]):], skip_special_tokens=False) print(decoded_output) ```
## License This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt). *You must not use this model in a manner that infringes, misappropriates, or violates any third party’s rights, including intellectual property rights.*