Instructions to use hypaai/Hypa-Llama3.1-8b-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hypaai/Hypa-Llama3.1-8b-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="hypaai/Hypa-Llama3.1-8b-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("hypaai/Hypa-Llama3.1-8b-SFT")
model = AutoModelForCausalLM.from_pretrained("hypaai/Hypa-Llama3.1-8b-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use hypaai/Hypa-Llama3.1-8b-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hypaai/Hypa-Llama3.1-8b-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hypaai/Hypa-Llama3.1-8b-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/hypaai/Hypa-Llama3.1-8b-SFT

SGLang

How to use hypaai/Hypa-Llama3.1-8b-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hypaai/Hypa-Llama3.1-8b-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hypaai/Hypa-Llama3.1-8b-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hypaai/Hypa-Llama3.1-8b-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hypaai/Hypa-Llama3.1-8b-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use hypaai/Hypa-Llama3.1-8b-SFT with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hypaai/Hypa-Llama3.1-8b-SFT to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hypaai/Hypa-Llama3.1-8b-SFT to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for hypaai/Hypa-Llama3.1-8b-SFT to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="hypaai/Hypa-Llama3.1-8b-SFT",
    max_seq_length=2048,
)

Docker Model Runner
How to use hypaai/Hypa-Llama3.1-8b-SFT with Docker Model Runner:
```
docker model run hf.co/hypaai/Hypa-Llama3.1-8b-SFT
```

A multilingual, tool-aware fine-tune of Meta's Llama 3.1 8B for low-resource and underrepresented languages.

Model Description

Hypa-Llama3.1 8B (hypaai/Hypa-Llama3.1-8b-SFT) is a LoRA-merged supervised fine-tune from the Llama 3.1 8B family, produced by Hypa Intelligence. It is the Llama-flavored sibling of our Hypa-Gemma 4 family, trained on the same multilingual instruction corpus and shaped around the same product surface, so customers and the open-source community can pick the runtime that best fits their deployment without changing the underlying capability surface.

This release covers seventeen languages: English, French, Spanish, and fourteen languages of Nigeria. Several of the smaller languages in this set (including Annang, Ebira, Eggon, Idoma, Igala, Nupe, and Urhobo) have not been formally represented in large-scale fine-tuning corpora before, or had no settled ISO-style language tag at the time we needed one.

The model is intended for translation, language detection, dictionary-style explanation (Markdown and JSON output modes), multilingual instruction-following, and translation correction / breakdown via an explicit reasoning channel. Unlike many fine-tunes, this is an iterative SFT continuation from one of our prior Hypa-Llama checkpoints rather than a from-scratch run on Meta's base model — each successive Hypa-Llama release inherits the capabilities of its predecessor and layers new prompt families on top.

Property	Value
Base model	`meta-llama/Llama-3.1-8B-Instruct` (continued from prior Hypa-Llama checkpoint)
Method	LoRA (r=256, α=256) via Unsloth + QLoRA, then merged to 16-bit
Trainable parameters	671M / 8.7B (7.71%)
Training data	17.0M examples across multilingual instruction sub-datasets
Compute	1× NVIDIA GPU (Runpod), 10.9 days
Languages	17
Context window	128K (config); 2,048 tokens during training
License	Apache 2.0 + Llama 3.1 Community License

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "hypaai/Hypa-Llama3.1-8b-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Hypa Translate. Translate from English to Igbo. Return only the exact translation."},
    {"role": "user", "content": "Good morning, how are you today?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=1.0,
    top_p=0.95,
    top_k=30,
    min_p=0.1,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

For thinking mode, prepend the literal marker <|think> to your system prompt content (e.g. "<|think>\nYou are Hypa Translate. Correct the below translation to Igbo."). The model will emit a <think>...</think> reasoning block before its visible answer.

For JSON dictionary mode, use the JSON-schema system prompts documented in the blog post and parse the assistant response directly.

For vLLM serving, the standard vllm serve hypaai/Hypa-Llama3.1-8b-SFT command works out of the box. See the blog post for the tokenizer-config compatibility steps if you hit deployment errors.

Languages Covered

Code	Language	Code	Language
`en`	English	`ibb`	Ibibio
`ann`	Annang	`idm`	Idoma
`efi`	Efik	`igl`	Igala
`ebi`	Ebira	`ig`	Igbo
`ego`	Eggon	`nup`	Nupe
`es`	Spanish	`pg`	Pidgin
`fr`	French	`tiv`	Tiv
`ha`	Hausa	`urh`	Urhobo
`yo`	Yoruba

Some of the smaller languages in this set required custom or non-standard tags because no widely-adopted machine-readable code existed at the time of training. Where ISO 639-3 codes were available, we used them; where they were not, we documented our internal codes in the data release so downstream users can reproduce splits.

Training Data

Training data comprises 17.0 million examples assembled from a large multilingual text mixture combining internal Hypa datasets and public instruction-style corpora. The mixture is identical to the one used for our Hypa-Gemma 4 family, enabling clean capability parity across model families. The overall training mixture included dictionary-style data, translation data, language detection data, synthetic instruction data, structured-JSON output data, and chain-of-thought translation breakdown / correction data — each contributing a different signal.

A public 10k subset of the training data is released as hypaai/Hypa-Text-10k. Additional sub-datasets are progressively being released under the hypaai organization.

Prompt Formatting

Every example was formatted using Llama 3.1's native chat template, with explicit system, user, and assistant roles and the canonical Llama 3 control tokens (<|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>, <|end_of_text|>). The reasoning channel was implemented via the literal markers <|think> (in the system prompt) and <think>...</think> (wrapping assistant reasoning) — these are byte-pair-tokenized regular strings rather than added special tokens, which keeps the tokenizer canonical and avoids vocabulary surgery during serving.

Training Procedure

Hyperparameter	Value
LoRA rank (r)	256
LoRA alpha (α)	256
LoRA dropout	0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization	4-bit base (NF4), bf16 compute
Optimizer	AdamW 8-bit
Learning rate	1e-4
LR schedule	cosine, 500 warmup steps
Weight decay	0.01
Max grad norm	1.0
Per-device batch size	16
Gradient accumulation	2
Effective batch size	32
Sequence length	2048
Packing	enabled
Epochs	1
Total steps	532,418
Precision	bfloat16
Gradient checkpointing	enabled (Unsloth)
Hardware	1× NVIDIA GPU (Runpod)
Runtime	10.9 days (261h 50m)
Random seed	3407

Training was performed using Unsloth, which provides hand-tuned Triton kernels for the attention and MLP forward/backward passes and an "unsloth" gradient checkpointing variant that uses ~30% less VRAM than vanilla checkpointing.

Evaluation and Recommendations

Training metrics

Final training loss: 0.213 (smooth monotonic decay from 0.971)
Best evaluation loss: 0.330 (at end of training)
Final evaluation loss: 0.330

Honest note on training dynamics

Unlike our Hypa-Gemma 4 E2B run, this Llama 3.1 run showed clean, well-behaved training dynamics. Both training and validation loss decreased monotonically across the entire 532,418-step run. The train-eval gap widened mildly through step 240k (peaking at 0.152) and then narrowed back to 0.117 by end of training — the signature of a model still fitting the data distribution rather than memorizing it. A final train:eval ratio of 0.213:0.330 ≈ 1.55× is on the healthy side for instruction tuning at this scale.

For downstream use, we recommend the merged 16-bit weights in this repository. The final checkpoint is the best checkpoint by evaluation loss; there is no separate "best" intermediate to recover.

That said, the final ~50,000 steps of training (roughly the last ~10% of the run) produced only ~0.6% of the total eval-loss improvement. With EarlyStoppingCallback(early_stopping_patience=2) configured against eval loss, training would have halted near step 480k–490k and saved approximately 25 hours of compute with negligible quality cost. We've queued this for the next run.

Qualitative observations

Internal qualitative review on translation and dictionary tasks shows meaningful improvements over the base Llama 3.1 8B-Instruct for every language in the set, with the largest deltas on the smallest languages (Annang, Efik, Ibibio, Eggon, Idoma, Igala, Nupe, Urhobo), where the base model was effectively unusable. Quantitative chrF++, BLEU, and BLEURT results across language pairs will follow in a separate evaluation post.

Intended Use

Direct use cases:

Translation between English / French / Spanish and the fourteen covered low-resource languages
Language detection across all seventeen languages
Dictionary-style lexical lookup and explanation (Markdown output)
Dictionary-style lexical lookup with strict JSON schema (programmatic use)
Translation correction and chain-of-thought translation breakdown (via the <|think> reasoning channel)
Multilingual instruction-following on dialogue tasks
Tool-aware / function-calling-style prompting (inheriting Llama 3.1's native tool-call structure)

Downstream use:

Suitable as a starting point for further fine-tuning on more specialized tasks within the supported languages
Suitable for adapter stacking (e.g., domain-specific LoRA on top)
Drop-in replacement for meta-llama/Llama-3.1-8B-Instruct in any text-generation pipeline that needs improved low-resource language quality

Out-of-Scope and Limitations

Not safety-tuned for sensitive domains. This model has not undergone RLHF or DPO post-training beyond the SFT in this run. It should not be used unsupervised for medical, legal, financial, or psychological-counseling applications.
Quality varies by language. The smallest languages in the set are underrepresented even within our training mix and the resulting model output should be reviewed by native speakers before being used in production.
Training context is 2,048 tokens. The model's config advertises a 131,072-token context window (inherited from Llama 3.1), but quality past 2,048 tokens is bounded by the training distribution and has not been validated for the target languages.
Tokenization quality. Llama 3's 128k-vocabulary BPE tokenizer is broader than smaller-vocabulary tokenizers but the smallest languages in this release still tokenize at higher cost per character than English. This is a gap we expect future iterations to close, including potential vocabulary extension.
JSON output reliability. Although we trained extensively on the JSON output schema, rare prompts occasionally produce minor schema deviations (extra whitespace, optional-field ordering). Production use of JSON mode should wrap responses in a permissive parser with single-attempt repair.
Coverage is finite. The seventeen languages in this release are the start, not the end. Many other underrepresented languages are not yet supported and may produce unreliable output.

Bias, Risks, and Limitations

This model inherits the biases and limitations of its base model (Meta Llama 3.1 8B) and adds the biases of its fine-tuning corpus, which is weighted toward dictionary, religious-parallel, and CommonVoice text. Religious-parallel text in particular is a known cause of register and content bias in low-resource translation models. Users deploying this model in customer-facing applications should evaluate output for cultural appropriateness in their specific use case and language.

The model is not intended to make decisions affecting people's rights, health, finances, or wellbeing. Like all language models, it can produce confident-sounding output that is incorrect, particularly on the smallest languages where training data was thinnest.

Released Artifacts

🤗 Merged 16-bit model (this repo): hypaai/Hypa-Llama3.1-8b-SFT
🤗 LoRA adapter checkpoints: hypaai/Hypa-Llama3.1-8b-SFT-LoRAs
📊 TensorBoard metrics: view on HF
📦 Public training data subset: hypaai/Hypa-Text-10k
💻 GitHub repository: hypaai/Hypa-Llama
📝 Blog post: Tuning Llama 3.1 for multilingual dictionary, translation, and tool-aware language understanding

Citation

If you use Hypa-Llama3.1 8B or any of the related work, please cite:

@misc{hypaai2026hypallama318b,
  title        = {Hypa-Llama3.1 8B: A Multilingual Fine-Tune of Llama 3.1 for Underrepresented Languages},
  author       = {{Hypa Intelligence}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/hypaai/Hypa-Llama3.1-8b-SFT}},
  note         = {Apache 2.0 + Llama 3.1 Community License. Blog: \url{https://hypa-intelligence.hashnode.dev/tuning-llama-3-1-for-multilingual-dictionary-translation-and-tool-aware-language-understanding}}
}

License

Released under the Apache License 2.0. As a derivative of Meta's Llama 3.1, this model is additionally subject to the Llama 3.1 Community License. Free to use, modify, and redistribute for both research and commercial purposes under the combined terms of both licenses.

Acknowledgments

Meta AI for releasing Llama 3.1 openly and enabling this line of research.
Unsloth for the hand-tuned training kernels that made an 11-day, 17M-example single-GPU run practical.
Runpod for reliable GPU infrastructure.
The language communities, speakers, and reviewers whose texts, voices, and feedback grounded this work and keep it honest.

Hypa Intelligence • Website • Hugging Face • GitHub • Blog

Multilingualism is not a feature. It is a prerequisite for AI that represents all of us.