For full technical specification, see our open methodology on GitHub.
A reproducible, openly licensed protocol for evaluating AI-powered Quranic recitation tools across five dimensions. Anyone โ developers, researchers, educators, or users โ can apply these tests to any tool.
This document defines a practical testing protocol that anyone can apply to assess how well a recitation tool performs its stated functions. It is not a certification or seal of approval. It draws on established research in Quranic speech recognition, including work on the Ar-DAD dataset and the knowledge-centric evaluation approach described in Hakami et al. (2025, arXiv:2510.12858).
The methodology assesses recitation tools across five dimensions, each scored on a 1โ5 scale.
| Dimension | What It Measures |
|---|---|
| Phoneme Accuracy | Correct identification of individual Arabic phonemes, especially minimal pairs |
| Tajweed Rule Detection | Detecting Ghunnah, Idghaam, Ikhfaa, Qalqalah, Madd, and Laam rules |
| Timing & Prosody | Madd duration, waqf/ibtidaa, recitation speed classification |
| Feedback Quality | Specificity, actionability, and pedagogical accuracy of AI feedback |
| Demographic Robustness | Equal performance across accents, ages, and genders |
Measures whether the AI can correctly distinguish individual Arabic sounds, particularly those that are phonetically close and commonly confused by learners. Test set: minimum 120 isolated phoneme pairs covering ุต/ุณ, ุถ/ุฏ, ุท/ุช, ุธ/ุฐ, ุน/ุฃ, ุบ/ุฎ, ู/ู, ุซ/ุณ, ุฐ/ุฒ. Minimum 3 speakers per pair: native Arab, non-native intermediate, beginner.
| Score | Criteria |
|---|---|
| 5 | 95%+ correct identification across all categories and speaker types |
| 4 | 85โ94% accuracy; occasional errors on hardest pairs |
| 3 | 70โ84% accuracy; reliable on emphatic pairs but weak on halqi distinctions |
| 2 | 50โ69% accuracy; frequent confusion on 2+ phoneme categories |
| 1 | Below 50%; effectively guessing or using word-level ASR without phoneme analysis |
Evaluates whether the tool can detect application or misapplication of specific tajweed rules during connected recitation. Mandatory rules: Ghunnah, Idghaam (with/without ghunnah), Ikhfaa (before 15 letters), Qalqalah (ูุทุจุฌุฏ at sukoon/waqf), Madd (natural 2, connected 4โ5, necessary 6 counts), Laam tafkheem/tarqeeq in "Allah".
| Score | Criteria |
|---|---|
| 5 | Detects and names all 6 mandatory rules with 90%+ accuracy |
| 4 | Detects 5/6 rules reliably; occasional misses on subtle rules |
| 3 | Detects 3โ4 rules; typically catches ghunnah and qalqalah |
| 2 | Detects 1โ2 rules; mostly word-level error detection |
| 1 | No tajweed-specific detection; tool only identifies word substitutions |
Assesses rhythmic and durational aspects: madd duration measurement, ghunnah duration, waqf/ibtidaa, and recitation speed classification (tahqeeq, tadweer, hadr). Tests use 10 verses with madd of different lengths, recorded at three speeds, with deliberately shortened/extended madd.
| Score | Criteria |
|---|---|
| 5 | Measures duration within 0.5 count; adapts to speed; identifies waqf appropriateness |
| 3 | Detects major timing errors but lacks precision on borderline cases |
| 1 | No timing analysis; tool only evaluates at word/phoneme level |
A tool's detection capability is only useful if translated into clear, actionable, pedagogically appropriate feedback. Criteria: Specificity (exact location, rule, error nature), Actionability (what to do differently), Pedagogical sensitivity (adapts to user level), Accuracy (correctly describes the issue).
| Score | Criteria |
|---|---|
| 5 | 90%+ feedback rated accurate and actionable by reviewing scholar; adapts to user level |
| 3 | Mostly accurate but often vague; user knows something is wrong but not what to fix |
| 1 | Generic feedback ("good"/"needs improvement") or frequently inaccurate |
AI speech systems are biased toward the demographic composition of their training data. This dimension ensures global accessibility across Arabic dialects, South Asian, Southeast Asian, African, and Western backgrounds; all genders; children, adolescents, adults, and elderly speakers; beginner to advanced proficiency.
| Score | Criteria |
|---|---|
| 5 | No demographic group drops more than 5% below overall average; tested across 6+ language backgrounds |
| 3 | 1โ2 demographic groups show 10โ15% accuracy drops; tested across 3โ4 language backgrounds |
| 1 | Large accuracy gaps (>20%) across demographics, or only tested on a single demographic group |
โข Hakami et al. (2025). "A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation." arXiv:2510.12858.
โข Atef et al. (2023). "Quran Recitation Recognition using End-to-End Deep Learning." arXiv:2305.07034.
โข Al-Ayyoub et al. (2023). "Speech Recognition Models for Holy Quran Recitation." IJACSA, Vol. 14, No. 12.
โข TajweedAI (2025). "A Hybrid ASR-Classifier for Real-Time Qalqalah Detection." NeurIPS 2025.
โข Tarteel AI (2024โ2025). ML Journey blog series.
License: Released under Creative Commons Attribution 4.0 (CC BY 4.0). Free to share, adapt, and build upon with appropriate credit: "QariAI Open Evaluation Methodology v1.0, qariai.app/open-methodology"
The app this methodology was built around. Check your Tajweed in real time โ no signup needed.
Start Reciting โ