Instructions to use alphaedge-ai/mmBERT-small-cym-32768 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use alphaedge-ai/mmBERT-small-cym-32768 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="alphaedge-ai/mmBERT-small-cym-32768")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("alphaedge-ai/mmBERT-small-cym-32768") model = AutoModel.from_pretrained("alphaedge-ai/mmBERT-small-cym-32768") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: fill-mask | |
| language: cym | |
| license: mit | |
| tags: | |
| - trimmed | |
| library_name: transformers | |
| base_model: jhu-clsp/mmBERT-small | |
| base_model_relation: quantized | |
| datasets: | |
| - lbourdois/fineweb-2-trimming | |
| # mmBERT-small-cym-32768 | |
| This model is a **61.01% smaller** version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) optimized for **Welsh** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method. | |
| This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary. | |
| ## Model Statistics | |
| | Metric | Original | Trimmed | Reduction | | |
| |--------|----------|---------|-----------| | |
| | **Vocabulary size** | 256,000 tokens | 32,768 tokens | **87.20%** | | |
| | **Model size** | 140,493,696 params | 54,772,608 params | **61.01%** | | |
|  | |
| ## Mining Dataset Statistics | |
| - **Number of texts used for mining**: 200,000 texts | |
| - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming) | |
| ## Usage | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| model_name = "alphaedge-ai/mmBERT-small-cym-32768" | |
| model = AutoModel.from_pretrained(model_name) | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| ``` | |
| ## Citations | |
| #### mmBERT | |
| ``` | |
| @misc{marone2025mmbertmodernmultilingualencoder, | |
| title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, | |
| author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, | |
| year={2025}, | |
| eprint={2509.06888}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2509.06888}, | |
| } | |
| ``` | |
| #### Trimming blog post | |
| ``` | |
| @misc{hf_blogpost_trimming, | |
| title={Introduction to Trimming}, | |
| author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI}, | |
| year={2026}, | |
| url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, | |
| } | |
| ``` |