alphaedge-ai
/

mmBERT-small-cym-32768

feature-extraction

Model card Files Files and versions

mmBERT-small-cym-32768 / README.md

lbourdois's picture

Update model card for Welsh

9e5581c verified 18 days ago

|

history blame contribute delete

2.36 kB

	---
	pipeline_tag: fill-mask
	language: cym
	license: mit
	tags:
	- trimmed
	library_name: transformers
	base_model: jhu-clsp/mmBERT-small
	base_model_relation: quantized
	datasets:
	- lbourdois/fineweb-2-trimming
	---

	# mmBERT-small-cym-32768
	This model is a 61.01% smaller version of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) optimized for Welsh language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
	This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.

	## Model Statistics
	\| Metric \| Original \| Trimmed \| Reduction \|
	\|--------\|----------\|---------\|-----------\|
	\| Vocabulary size \| 256,000 tokens \| 32,768 tokens \| 87.20% \|
	\| Model size \| 140,493,696 params \| 54,772,608 params \| 61.01% \|

	![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/mmBERT-small-32768.png)

	## Mining Dataset Statistics
	- Number of texts used for mining: 200,000 texts
	- Dataset: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)

	## Usage
	```python
	from transformers import AutoModel, AutoTokenizer

	model_name = "alphaedge-ai/mmBERT-small-cym-32768"
	model = AutoModel.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	```

	## Citations

	#### mmBERT
	```
	@misc{marone2025mmbertmodernmultilingualencoder,
	title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
	author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
	year={2025},
	eprint={2509.06888},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.06888},
	}
	```

	#### Trimming blog post
	```
	@misc{hf_blogpost_trimming,
	title={Introduction to Trimming},
	author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
	year={2026},
	url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
	}
	```