Hypa-Llama3.1 8B

A multilingual, tool-aware fine-tune of Meta's Llama 3.1 8B for low-resource and underrepresented languages.

License: Apache 2.0 Base: Llama 3.1 8B GitHub: Hypa-Llama Blog Post Trained with Unsloth

Model Description

Hypa-Llama3.1 8B (hypaai/Hypa-Llama3.1-8b-SFT) is a LoRA-merged supervised fine-tune from the Llama 3.1 8B family, produced by Hypa Intelligence. It is the Llama-flavored sibling of our Hypa-Gemma 4 family, trained on the same multilingual instruction corpus and shaped around the same product surface, so customers and the open-source community can pick the runtime that best fits their deployment without changing the underlying capability surface.

This release covers seventeen languages: English, French, Spanish, and fourteen languages of Nigeria. Several of the smaller languages in this set (including Annang, Ebira, Eggon, Idoma, Igala, Nupe, and Urhobo) have not been formally represented in large-scale fine-tuning corpora before, or had no settled ISO-style language tag at the time we needed one.

The model is intended for translation, language detection, dictionary-style explanation (Markdown and JSON output modes), multilingual instruction-following, and translation correction / breakdown via an explicit reasoning channel. Unlike many fine-tunes, this is an iterative SFT continuation from one of our prior Hypa-Llama checkpoints rather than a from-scratch run on Meta's base model — each successive Hypa-Llama release inherits the capabilities of its predecessor and layers new prompt families on top.

Property Value
Base model meta-llama/Llama-3.1-8B-Instruct (continued from prior Hypa-Llama checkpoint)
Method LoRA (r=256, α=256) via Unsloth + QLoRA, then merged to 16-bit
Trainable parameters 671M / 8.7B (7.71%)
Training data 17.0M examples across multilingual instruction sub-datasets
Compute 1× NVIDIA GPU (Runpod), 10.9 days
Languages 17
Context window 128K (config); 2,048 tokens during training
License Apache 2.0 + Llama 3.1 Community License

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "hypaai/Hypa-Llama3.1-8b-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Hypa Translate. Translate from English to Igbo. Return only the exact translation."},
    {"role": "user", "content": "Good morning, how are you today?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=1.0,
    top_p=0.95,
    top_k=30,
    min_p=0.1,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

For thinking mode, prepend the literal marker <|think> to your system prompt content (e.g. "<|think>\nYou are Hypa Translate. Correct the below translation to Igbo."). The model will emit a <think>...</think> reasoning block before its visible answer.

For JSON dictionary mode, use the JSON-schema system prompts documented in the blog post and parse the assistant response directly.

For vLLM serving, the standard vllm serve hypaai/Hypa-Llama3.1-8b-SFT command works out of the box. See the blog post for the tokenizer-config compatibility steps if you hit deployment errors.

Languages Covered

Code Language Code Language
en English ibb Ibibio
ann Annang idm Idoma
efi Efik igl Igala
ebi Ebira ig Igbo
ego Eggon nup Nupe
es Spanish pg Pidgin
fr French tiv Tiv
ha Hausa urh Urhobo
yo Yoruba

Some of the smaller languages in this set required custom or non-standard tags because no widely-adopted machine-readable code existed at the time of training. Where ISO 639-3 codes were available, we used them; where they were not, we documented our internal codes in the data release so downstream users can reproduce splits.

Training Data

Training data comprises 17.0 million examples assembled from a large multilingual text mixture combining internal Hypa datasets and public instruction-style corpora. The mixture is identical to the one used for our Hypa-Gemma 4 family, enabling clean capability parity across model families. The overall training mixture included dictionary-style data, translation data, language detection data, synthetic instruction data, structured-JSON output data, and chain-of-thought translation breakdown / correction data — each contributing a different signal.

A public 10k subset of the training data is released as hypaai/Hypa-Text-10k. Additional sub-datasets are progressively being released under the hypaai organization.

Prompt Formatting

Every example was formatted using Llama 3.1's native chat template, with explicit system, user, and assistant roles and the canonical Llama 3 control tokens (<|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>, <|end_of_text|>). The reasoning channel was implemented via the literal markers <|think> (in the system prompt) and <think>...</think> (wrapping assistant reasoning) — these are byte-pair-tokenized regular strings rather than added special tokens, which keeps the tokenizer canonical and avoids vocabulary surgery during serving.

Loss was computed only on assistant turns via train_on_responses_only with instruction_part="<|start_header_id|>user<|end_header_id|>\n\n" and response_part="<|start_header_id|>assistant<|end_header_id|>\n\n".

Training Procedure

Hyperparameter Value
LoRA rank (r) 256
LoRA alpha (α) 256
LoRA dropout 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit base (NF4), bf16 compute
Optimizer AdamW 8-bit
Learning rate 1e-4
LR schedule cosine, 500 warmup steps
Weight decay 0.01
Max grad norm 1.0
Per-device batch size 16
Gradient accumulation 2
Effective batch size 32
Sequence length 2048
Packing enabled
Epochs 1
Total steps 532,418
Precision bfloat16
Gradient checkpointing enabled (Unsloth)
Hardware 1× NVIDIA GPU (Runpod)
Runtime 10.9 days (261h 50m)
Random seed 3407

Training was performed using Unsloth, which provides hand-tuned Triton kernels for the attention and MLP forward/backward passes and an "unsloth" gradient checkpointing variant that uses ~30% less VRAM than vanilla checkpointing.

Evaluation and Recommendations

Training metrics

  • Final training loss: 0.213 (smooth monotonic decay from 0.971)
  • Best evaluation loss: 0.330 (at end of training)
  • Final evaluation loss: 0.330

Honest note on training dynamics

Unlike our Hypa-Gemma 4 E2B run, this Llama 3.1 run showed clean, well-behaved training dynamics. Both training and validation loss decreased monotonically across the entire 532,418-step run. The train-eval gap widened mildly through step 240k (peaking at 0.152) and then narrowed back to 0.117 by end of training — the signature of a model still fitting the data distribution rather than memorizing it. A final train:eval ratio of 0.213:0.330 ≈ 1.55× is on the healthy side for instruction tuning at this scale.

For downstream use, we recommend the merged 16-bit weights in this repository. The final checkpoint is the best checkpoint by evaluation loss; there is no separate "best" intermediate to recover.

That said, the final ~50,000 steps of training (roughly the last ~10% of the run) produced only ~0.6% of the total eval-loss improvement. With EarlyStoppingCallback(early_stopping_patience=2) configured against eval loss, training would have halted near step 480k–490k and saved approximately 25 hours of compute with negligible quality cost. We've queued this for the next run.

Qualitative observations

Internal qualitative review on translation and dictionary tasks shows meaningful improvements over the base Llama 3.1 8B-Instruct for every language in the set, with the largest deltas on the smallest languages (Annang, Efik, Ibibio, Eggon, Idoma, Igala, Nupe, Urhobo), where the base model was effectively unusable. Quantitative chrF++, BLEU, and BLEURT results across language pairs will follow in a separate evaluation post.

Intended Use

Direct use cases:

  • Translation between English / French / Spanish and the fourteen covered low-resource languages
  • Language detection across all seventeen languages
  • Dictionary-style lexical lookup and explanation (Markdown output)
  • Dictionary-style lexical lookup with strict JSON schema (programmatic use)
  • Translation correction and chain-of-thought translation breakdown (via the <|think> reasoning channel)
  • Multilingual instruction-following on dialogue tasks
  • Tool-aware / function-calling-style prompting (inheriting Llama 3.1's native tool-call structure)

Downstream use:

  • Suitable as a starting point for further fine-tuning on more specialized tasks within the supported languages
  • Suitable for adapter stacking (e.g., domain-specific LoRA on top)
  • Drop-in replacement for meta-llama/Llama-3.1-8B-Instruct in any text-generation pipeline that needs improved low-resource language quality

Out-of-Scope and Limitations

  • Not safety-tuned for sensitive domains. This model has not undergone RLHF or DPO post-training beyond the SFT in this run. It should not be used unsupervised for medical, legal, financial, or psychological-counseling applications.
  • Quality varies by language. The smallest languages in the set are underrepresented even within our training mix and the resulting model output should be reviewed by native speakers before being used in production.
  • Training context is 2,048 tokens. The model's config advertises a 131,072-token context window (inherited from Llama 3.1), but quality past 2,048 tokens is bounded by the training distribution and has not been validated for the target languages.
  • Tokenization quality. Llama 3's 128k-vocabulary BPE tokenizer is broader than smaller-vocabulary tokenizers but the smallest languages in this release still tokenize at higher cost per character than English. This is a gap we expect future iterations to close, including potential vocabulary extension.
  • JSON output reliability. Although we trained extensively on the JSON output schema, rare prompts occasionally produce minor schema deviations (extra whitespace, optional-field ordering). Production use of JSON mode should wrap responses in a permissive parser with single-attempt repair.
  • Coverage is finite. The seventeen languages in this release are the start, not the end. Many other underrepresented languages are not yet supported and may produce unreliable output.

Bias, Risks, and Limitations

This model inherits the biases and limitations of its base model (Meta Llama 3.1 8B) and adds the biases of its fine-tuning corpus, which is weighted toward dictionary, religious-parallel, and CommonVoice text. Religious-parallel text in particular is a known cause of register and content bias in low-resource translation models. Users deploying this model in customer-facing applications should evaluate output for cultural appropriateness in their specific use case and language.

The model is not intended to make decisions affecting people's rights, health, finances, or wellbeing. Like all language models, it can produce confident-sounding output that is incorrect, particularly on the smallest languages where training data was thinnest.

Released Artifacts

Citation

If you use Hypa-Llama3.1 8B or any of the related work, please cite:

@misc{hypaai2026hypallama318b,
  title        = {Hypa-Llama3.1 8B: A Multilingual Fine-Tune of Llama 3.1 for Underrepresented Languages},
  author       = {{Hypa Intelligence}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/hypaai/Hypa-Llama3.1-8b-SFT}},
  note         = {Apache 2.0 + Llama 3.1 Community License. Blog: \url{https://hypa-intelligence.hashnode.dev/tuning-llama-3-1-for-multilingual-dictionary-translation-and-tool-aware-language-understanding}}
}

License

Released under the Apache License 2.0. As a derivative of Meta's Llama 3.1, this model is additionally subject to the Llama 3.1 Community License. Free to use, modify, and redistribute for both research and commercial purposes under the combined terms of both licenses.

Acknowledgments

  • Meta AI for releasing Llama 3.1 openly and enabling this line of research.
  • Unsloth for the hand-tuned training kernels that made an 11-day, 17M-example single-GPU run practical.
  • Runpod for reliable GPU infrastructure.
  • The language communities, speakers, and reviewers whose texts, voices, and feedback grounded this work and keep it honest.

Hypa IntelligenceWebsiteHugging FaceGitHubBlog

Multilingualism is not a feature. It is a prerequisite for AI that represents all of us.

Downloads last month
1,056
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hypaai/Hypa-Llama3.1-8b-SFT

Adapter
(2407)
this model
Adapters
1 model
Finetunes
2 models

Dataset used to train hypaai/Hypa-Llama3.1-8b-SFT

Collection including hypaai/Hypa-Llama3.1-8b-SFT