lbourdois's picture
Update model card for Welsh
9e5581c verified
---
pipeline_tag: fill-mask
language: cym
license: mit
tags:
- trimmed
library_name: transformers
base_model: jhu-clsp/mmBERT-small
base_model_relation: quantized
datasets:
- lbourdois/fineweb-2-trimming
---
# mmBERT-small-cym-32768
This model is a **61.01% smaller** version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) optimized for **Welsh** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
## Model Statistics
| Metric | Original | Trimmed | Reduction |
|--------|----------|---------|-----------|
| **Vocabulary size** | 256,000 tokens | 32,768 tokens | **87.20%** |
| **Model size** | 140,493,696 params | 54,772,608 params | **61.01%** |
![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mmBERT-small-32768.png)
## Mining Dataset Statistics
- **Number of texts used for mining**: 200,000 texts
- **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
## Usage
```python
from transformers import AutoModel, AutoTokenizer
model_name = "alphaedge-ai/mmBERT-small-cym-32768"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
## Citations
#### mmBERT
```
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
```
#### Trimming blog post
```
@misc{hf_blogpost_trimming,
title={Introduction to Trimming},
author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
year={2026},
url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
}
```