Leaderboard scores are the least informative number about an Arabic model. Here is why translated benchmarks mislead, where aggregate scores hide the failures that matter, and the evaluation protocol we recommend before anything ships.
Every Arabic model announcement arrives with a benchmark table, and every benchmark table tells the same comfortable story: the model is competitive. Then the model meets production traffic — dialectal, code-switched, culturally loaded — and support tickets tell a different story. The gap is rarely the model's fault alone. It is the evaluation's.
Much of the Arabic evaluation stack is machine-translated from English suites. Translation carries over English framing, references and idiom, producing "translationese" that no Arabic speaker would write — so the eval rewards models that are good at translated English rather than good at Arabic. Worse, culturally specific questions arrive intact: a model can ace translated trivia about American institutions while knowing little about the region it will serve.
Arabic is diglossic: formal writing happens in Modern Standard Arabic, but everyday speech and chat happen in dialect. An evaluation composed of clean MSA prompts leaves the model untested on the inputs it will actually receive. A single "Arabic" score with no dialect breakdown should be treated as an MSA score.
Averaging across tasks and varieties produces one flattering number. But a model that is strong in Egyptian and weak in Gulf is not "moderately good" — it is unusable for a Saudi deployment. Scores must be reported per variety and per capability, or they conceal exactly the risk you are trying to measure.
Arabic writing tolerates surface variation — hamza forms, taa marbuta, optional diacritics — that exact-match metrics punish arbitrarily unless normalization is handled deliberately. And Arabic text typically fragments into more tokens per word than English in common tokenizers, which quietly affects context budgets, latency and cost comparisons between models.
A trustworthy Arabic evaluation reports each of the following separately, per dialect and per domain — never as one blended number:
The mechanics matter as much as the dimensions. The evaluations we trust share five properties:
None of this is exotic — it is the same discipline frontier labs apply to English evaluation, applied with people who actually speak the varieties being measured. The expensive part is the human judgment. That is also the part that cannot be skipped: for a diglossic language, human raters are not a nice-to-have on top of automatic metrics; they are the metric.
We build independent, dialect-stratified benchmarks with native raters and calibrated rubrics — for model selection, fine-tune regression testing and pre-launch sign-off.
Scope an evaluation