Saudi Firm Launches Community-Driven Arabic Text-to-Speech Rankings

17
Saudi Firm Launches Community-Driven Arabic Text-to-Speech Rankings

Riyadh-based AI company Navid, a division of Watad, has introduced the Arabic TTS Arena, a new platform designed to evaluate Arabic text-to-speech (TTS) models based on human preference. This open, community-driven leaderboard allows native Arabic speakers to directly compare AI-generated voices, voting for which sounds more natural. The system ranks models using the Bradley-Terry rating model—the same method used to rank chess players and power the popular LMArena language model leaderboard—transforming individual votes into statistically-backed scores.

Why This Matters: Arabic is spoken by over 400 million people across 20+ countries, yet high-quality TTS remains a recent development. Traditional TTS evaluation relies on lab tests and algorithmic benchmarks, which often fail to capture what people actually prefer. The Arabic TTS Arena flips this model, prioritizing real-world listening experience. This is especially crucial for Arabic, a language with immense dialectal variation where “sounding natural” is highly subjective.

Key Features of the Arabic TTS Arena

The platform, hosted on Hugging Face, currently ranks 15 models, including both open-source and commercial systems:

  • Arabic F5-TTS
  • Arabic Spark TTS
  • Chatterbox
  • Fish Speech
  • Habibi TTS
  • Hamsa TTS
  • KaniTTS Arabic
  • Lahgtna
  • MOSS-TTS
  • OuteTTS
  • Silma TSS (small & large)
  • SpeechT5 Arabic
  • XTTS v2

The Arena’s design ensures unbiased voting: model identities are hidden until after each comparison, preventing pre-existing brand reputation from influencing results. Adding a new model is simple, requiring only a Python class implementation.

Beyond Sound Quality: The TTS Triangle

Navid’s research highlights the “TTS Triangle”—a framework arguing that effective speech synthesis must address three dimensions: what is said, who is saying it, and how it’s delivered. Most existing Arabic TTS models, they claim, only fully address one or two of these.

The team argues that reducing Arabic’s dialectal diversity to broad country-level labels (e.g., “Egyptian” or “Saudi”) is inadequate. Dialects vary drastically even within cities, making specific reference speaker identities more valuable than generic regional classifications.

Furthermore, they criticize emotion tags (like “[laugh]” or “[sad]”) as artificial. Human emotion permeates entire utterances, rather than appearing as isolated markers. Instead, they advocate for natural language delivery instructions—similar to how voice actors are directed.

Context: Saudi Arabia’s Growing AI Ambitions

This launch builds on previous work by Watad, the parent company of Navid. In March 2024, Watad released Mulhem, a Saudi Arabia-specific large language model trained entirely on domestic data. Mulhem outperformed comparable models in initial tests, demonstrating the Kingdom’s growing investment in localized AI development.

“For synthetic speech, a benchmark that reflects what sounds people actually prefer to hear could be fundamentally more useful than one that reflects what an algorithm thinks sounds correct.”

The Arabic TTS Arena represents a shift toward more human-centered AI evaluation—a trend likely to expand as language models become more sophisticated and localized.