---
language:
- en
- de
- fr
- es
- it
- pt
- nl
- pl
- ru
- uk
- cs
- ro
- hu
- sv
- da
- fi
- no
- el
- bg
- sk
- hr
- sr
- tr
license: mit
tags:
- text-to-speech
- tts
- speech-synthesis
- audio-generation
- european-languages
- diffusion
- autoregressive
pipeline_tag: text-to-speech
inference: false
model-index:
- name: kugelaudio-0-open
results:
- task:
type: text-to-speech
dataset:
type: custom
name: YODAS2
metrics:
- type: win-rate
value: 78.0
name: Human Preference vs ElevenLabs
---
# 🎙️ KugelAudio-0-Open
**Open-source text-to-speech for European languages**
7B parameter model powered by an AR + Diffusion architecture
License: MIT Python 3.10+ Hosted API
KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefördert durch BMFTR
---
## Motivation
**Open-source text-to-speech models for European languages are significantly lagging behind.** While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.
KugelAudio aims to change this. Building on the excellent foundation laid by the [VibeVoice team at Microsoft](https://github.com/microsoft/VibeVoice), we've trained a model specifically focused on European language coverage, using approximately **200,000 hours** of highly pre-processed and enhanced speech data from the [YODAS2 dataset](https://huggingface.co/datasets/espnet/yodas).
## 🏆 Benchmark Results: Outperforming ElevenLabs
**KugelAudio achieves state-of-the-art performance**, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.
### Human Preference Benchmark (A/B Testing)
We conducted extensive A/B testing with **339 human evaluations** to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.
### German Language Evaluation
The evaluation specifically focused on **German language samples** with diverse emotional expressions and speaking styles:
* **Neutral Speech**: Standard conversational tones
* **Shouting**: High-intensity, elevated volume speech
* **Singing**: Melodic and rhythmic speech patterns
* **Drunken Voice**: Slurred and irregular speech characteristics
These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.
### OpenSkill Ranking Results
| Rank | Model | Score | Record | Win Rate |
|------|-------|-------|--------|----------|
| 🥇 1 | **KugelAudio** | **26** | 71W / 20L / 23T | **78.0%** |
| 🥈 2 | ElevenLabs Multi v2 | 25 | 56W / 34L / 22T | 62.2% |
| 🥉 3 | ElevenLabs v3 | 21 | 64W / 34L / 16T | 65.3% |
| 4 | Cartesia | 21 | 55W / 38L / 19T | 59.1% |
| 5 | VibeVoice | 10 | 30W / 74L / 8T | 28.8% |
| 6 | CosyVoice v3 | 9 | 15W / 91L / 8T | 14.2% |
_Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)_
## Audio Samples
Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:
### German Voice Samples
| Sample | Description | Audio Player |
|--------|-------------|--------------|
| **Whispering** | Soft whispering voice | |
| **Female Narrator** | Professional female reader voice | |
| **Angry Voice** | Irritated and frustrated speech | |
| **Radio Announcer** | Professional radio broadcast voice | |
*All samples are generated using pre-encoded voice embeddings.*
### Training Details
- **Base Model**: [Microsoft VibeVoice](https://github.com/microsoft/VibeVoice)
- **Training Data**: ~200,000 hours from [YODAS2](https://huggingface.co/datasets/espnet/yodas)
- **Hardware**: 8x NVIDIA H100 GPUs
- **Training Duration**: 5 days
### Supported Languages
This model supports the following European languages:
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|----------|------|------|----------|------|------|----------|------|------|
| English | en | 🇺🇸 | German | de | 🇩🇪 | French | fr | 🇫🇷 |
| Spanish | es | 🇪🇸 | Italian | it | 🇮🇹 | Portuguese | pt | 🇵🇹 |
| Dutch | nl | 🇳🇱 | Polish | pl | 🇵🇱 | Russian | ru | 🇷🇺 |
| Ukrainian | uk | 🇺🇦 | Czech | cs | 🇨🇿 | Romanian | ro | 🇷🇴 |
| Hungarian | hu | 🇭🇺 | Swedish | sv | 🇸🇪 | Danish | da | 🇩🇰 |
| Finnish | fi | 🇫🇮 | Norwegian | no | 🇳🇴 | Greek | el | 🇬🇷 |
| Bulgarian | bg | 🇧🇬 | Slovak | sk | 🇸🇰 | Croatian | hr | 🇭🇷 |
| Serbian | sr | 🇷🇸 | Turkish | tr | 🇹🇷 | | | |
> **📊 Language Coverage Disclaimer**: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.
### Model Specifications
| Property | Value |
| --------------------- | --------------------------------------------------------------------------- |
| **Parameters** | 7B |
| **Architecture** | AR + Diffusion (Qwen2.5-7B backbone) |
| **Base Model** | [Microsoft VibeVoice](https://github.com/microsoft/VibeVoice) |
| **Audio Sample Rate** | 24kHz |
| **Audio Format** | Mono, float32 |
| **VRAM Required** | \~19GB |
| **Training Hardware** | 8x NVIDIA H100 |
| **Training Duration** | 5 days |
| **Training Data** | \~200,000 hours from [YODAS2](https://huggingface.co/datasets/espnet/yodas) |
## Quick Start
### Installation
```bash
# Install with pip
pip install kugelaudio-open
# Or with uv (recommended)
uv pip install kugelaudio-open
```
### Basic Usage
```python
from kugelaudio_open import (
KugelAudioForConditionalGenerationInference,
KugelAudioProcessor,
)
import torch
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
"kugelaudio/kugelaudio-0-open",
torch_dtype=torch.bfloat16,
).to(device)
model.eval()
processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")
# Strip encoder weights to save VRAM (only decoders needed for inference)
model.model.strip_encoders()
# See available voices
print(processor.get_available_voices()) # ["default", "warm", "clear"]
# Generate speech with a specific voice
inputs = processor(text="Hallo Welt! Das ist KugelAudio.", voice="default", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")
```
### Voices
KugelAudio provides pre-encoded voices that can be selected by name. The voices are stored as `.pt` files in the `voices/` folder and are automatically downloaded when needed.
```python
# List available voices
voices = processor.get_available_voices()
print(voices) # ["default", "warm", "clear"]
# Generate with a specific voice
inputs = processor(text="Hallo, das ist eine warme Stimme!", voice="warm", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
processor.save_audio(outputs.speech_outputs[0], "warm_voice_output.wav")
```
> **Note:** Voice cloning from raw audio is not supported in this open-source release. Only the pre-encoded voices listed in `voices/voices.json` are available.
### Generation Parameters
| Parameter | Default | Description |
| ---------------- | ------- | -------------------------------------------------------------------------- |
| cfg\_scale | 3.0 | Classifier-free guidance scale (1.0-10.0). Higher = more adherence to text |
| max\_new\_tokens | 2048 | Maximum number of tokens to generate |
| do\_sample | False | Whether to use sampling (vs greedy decoding) |
| temperature | 1.0 | Sampling temperature (if do_sample=True) |
## Architecture
KugelAudio uses a hybrid **Autoregressive + Diffusion** architecture based on Microsoft's VibeVoice:
```
Text Input → Qwen2.5-7B Backbone → Diffusion Head → Acoustic Decoder → Audio Output
↑
Pre-encoded Voice Embedding
```
1. **Text Encoder**: Qwen2.5-7B language model encodes input text
2. **Diffusion Head**: Predicts speech latents using denoising diffusion (20 steps)
3. **Acoustic Decoder**: Hierarchical convolutional decoder converts latents to 24kHz audio
## Audio Watermarking
All audio generated by this model is automatically watermarked using Facebook's AudioSeal. The watermark is:
* **Imperceptible**: No audible difference in audio quality
* **Robust**: Survives compression, resampling, and editing
* **Detectable**: Can verify if audio was generated by KugelAudio
### Verify Watermark
```python
from kugelaudio_open.watermark import AudioWatermark
watermark = AudioWatermark()
result = watermark.detect(audio, sample_rate=24000)
print(f"Watermark detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")
```
## Intended Use
### ✅ Appropriate Uses
* **Accessibility**: Text-to-speech for visually impaired users
* **Content Creation**: Podcasts, videos, audiobooks, e-learning
* **Voice Assistants**: Chatbots and virtual assistants
* **Language Learning**: Pronunciation practice and language education
* **Creative Projects**: With proper consent and attribution
### ❌ Prohibited Uses
* Creating deepfakes or misleading content
* Impersonating individuals without explicit consent
* Fraud, deception, or scams
* Harassment or abuse
* Any illegal activities
## Limitations
* **VRAM Requirements**: Requires \~19GB VRAM for inference (less with `strip_encoders()`)
* **Speed**: Approximately 1.0x real-time on modern GPUs
* **Language Quality Variation**: Quality may vary across languages based on training data distribution
## Hosted API
For production use without managing infrastructure, use our hosted API at kugelaudio.com:
* ⚡ **Ultra-low latency**: <100ms end-to-end
* 🌍 **Global edge deployment**
* 🔧 **Zero setup required**
* 📈 **Auto-scaling**
```python
from kugelaudio import KugelAudio
client = KugelAudio(api_key="your_api_key")
audio = client.tts.generate(text="Hello from KugelAudio!", model="kugel-1-turbo")
audio.save("output.wav")
```
## Acknowledgments
This model would not have been possible without the contributions of many individuals and organizations:
* **Microsoft VibeVoice Team**: For the excellent foundation architecture that this model builds upon
* **YODAS2 Dataset**: For providing the large-scale multilingual speech data
* **Qwen Team**: For the powerful language model backbone
* **Facebook AudioSeal**: For the audio watermarking technology
### Special Thanks
* **Carlos Menke**: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
* **AI Service Center Berlin-Brandenburg (KI-Servicezentrum)**: For providing the GPU resources (8x H100) that made training this model possible
## Citation
```bibtex
@software{kugelaudio2026,
title = {KugelAudio: Open-Source Text-to-Speech for European Languages},
author = {Kratzenstein, Kajo and Menke, Carlos},
year = {2026},
institution = {Hasso-Plattner-Institut},
url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}
```
## License
This model is released under the MIT License.
## Author
**Kajo Kratzenstein**
📧 [kajo@kugelaudio.com](mailto:kajo@kugelaudio.com)
🌐 [kugelaudio.com](https://kugelaudio.com)
**Carlos Menke**
---
**Funding Notice**
Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Forschung, Technologie und Raumfahrt unter dem Förderkennzeichen »KI-Servicezentrum Berlin-Brandenburg« 16IS22092 gefördert.
_This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092._