Open Evaluation Framework for AI Quran Recitation Tools

For full technical specification, see our open methodology on GitHub.

📅 Published March 2026 🏛️ QariAI Research ⚖️ CC BY 4.0 — Free to use and cite 🔗 qariai.app/open-methodology

A reproducible, openly licensed protocol for evaluating AI-powered Quranic recitation tools across five dimensions. Anyone — developers, researchers, educators, or users — can apply these tests to any tool.

DISCLOSURE: QariAI is both the publisher and a subject of this evaluation methodology. We believe transparency about this dual role strengthens rather than undermines it. We invite independent researchers, scholars, and competing products to adopt, critique, and improve these criteria.

1. Purpose & Scope

This document defines a practical testing protocol that anyone can apply to assess how well a recitation tool performs its stated functions. It is not a certification or seal of approval. It draws on established research in Quranic speech recognition, including work on the Ar-DAD dataset and the knowledge-centric evaluation approach described in Hakami et al. (2025, arXiv:2510.12858).

What this document is

A testing protocol — specific, repeatable tests anyone can run against any tool
Transparent — published openly; QariAI's own results published alongside competitor results
Improvable — community contributions and critiques are welcomed

What this document is not

Not a certification — no product "passes" or "fails"; results are scored on specific dimensions
Not independent — QariAI authored this; independent adoption would strengthen it
Not a substitute for scholars — no automated evaluation replaces a qualified Qari with an ijazah

2. Evaluation Dimensions

The methodology assesses recitation tools across five dimensions, each scored on a 1–5 scale.

Dimension	What It Measures
Phoneme Accuracy	Correct identification of individual Arabic phonemes, especially minimal pairs
Tajweed Rule Detection	Detecting Ghunnah, Idghaam, Ikhfaa, Qalqalah, Madd, and Laam rules
Timing & Prosody	Madd duration, waqf/ibtidaa, recitation speed classification
Feedback Quality	Specificity, actionability, and pedagogical accuracy of AI feedback
Demographic Robustness	Equal performance across accents, ages, and genders

Dimension 1

Phoneme Accuracy

Measures whether the AI can correctly distinguish individual Arabic sounds, particularly those that are phonetically close and commonly confused by learners. Test set: minimum 120 isolated phoneme pairs covering ص/س, ض/د, ط/ت, ظ/ذ, ع/أ, غ/خ, ق/ك, ث/س, ذ/ز. Minimum 3 speakers per pair: native Arab, non-native intermediate, beginner.

Score	Criteria
5	95%+ correct identification across all categories and speaker types
4	85–94% accuracy; occasional errors on hardest pairs
3	70–84% accuracy; reliable on emphatic pairs but weak on halqi distinctions
2	50–69% accuracy; frequent confusion on 2+ phoneme categories
1	Below 50%; effectively guessing or using word-level ASR without phoneme analysis

Dimension 2

Tajweed Rule Detection

Evaluates whether the tool can detect application or misapplication of specific tajweed rules during connected recitation. Mandatory rules: Ghunnah, Idghaam (with/without ghunnah), Ikhfaa (before 15 letters), Qalqalah (قطبجد at sukoon/waqf), Madd (natural 2, connected 4–5, necessary 6 counts), Laam tafkheem/tarqeeq in "Allah".

Score	Criteria
5	Detects and names all 6 mandatory rules with 90%+ accuracy
4	Detects 5/6 rules reliably; occasional misses on subtle rules
3	Detects 3–4 rules; typically catches ghunnah and qalqalah
2	Detects 1–2 rules; mostly word-level error detection
1	No tajweed-specific detection; tool only identifies word substitutions

Dimension 3

Timing & Prosody

Assesses rhythmic and durational aspects: madd duration measurement, ghunnah duration, waqf/ibtidaa, and recitation speed classification (tahqeeq, tadweer, hadr). Tests use 10 verses with madd of different lengths, recorded at three speeds, with deliberately shortened/extended madd.

Score	Criteria
5	Measures duration within 0.5 count; adapts to speed; identifies waqf appropriateness
3	Detects major timing errors but lacks precision on borderline cases
1	No timing analysis; tool only evaluates at word/phoneme level

Dimension 4

Feedback Quality

A tool's detection capability is only useful if translated into clear, actionable, pedagogically appropriate feedback. Criteria: Specificity (exact location, rule, error nature), Actionability (what to do differently), Pedagogical sensitivity (adapts to user level), Accuracy (correctly describes the issue).

Score	Criteria
5	90%+ feedback rated accurate and actionable by reviewing scholar; adapts to user level
3	Mostly accurate but often vague; user knows something is wrong but not what to fix
1	Generic feedback ("good"/"needs improvement") or frequently inaccurate

Dimension 5

Demographic Robustness

AI speech systems are biased toward the demographic composition of their training data. This dimension ensures global accessibility across Arabic dialects, South Asian, Southeast Asian, African, and Western backgrounds; all genders; children, adolescents, adults, and elderly speakers; beginner to advanced proficiency.

Score	Criteria
5	No demographic group drops more than 5% below overall average; tested across 6+ language backgrounds
3	1–2 demographic groups show 10–15% accuracy drops; tested across 3–4 language backgrounds
1	Large accuracy gaps (>20%) across demographics, or only tested on a single demographic group

3. Limitations

No existing tool scores 5/5 across all dimensions.
Dataset bias is a real problem — most training datasets are dominated by adult male Arab voices.
Tajweed rule detection is significantly harder than word recognition.
Human teachers remain essential; AI tools cannot replace the spiritual and pedagogical dimensions of human instruction.

4. Academic References

References

• Hakami et al. (2025). "A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation." arXiv:2510.12858.

• Atef et al. (2023). "Quran Recitation Recognition using End-to-End Deep Learning." arXiv:2305.07034.

• Al-Ayyoub et al. (2023). "Speech Recognition Models for Holy Quran Recitation." IJACSA, Vol. 14, No. 12.

• TajweedAI (2025). "A Hybrid ASR-Classifier for Real-Time Qalqalah Detection." NeurIPS 2025.

• Tarteel AI (2024–2025). ML Journey blog series.

License: Released under Creative Commons Attribution 4.0 (CC BY 4.0). Free to share, adapt, and build upon with appropriate credit: "QariAI Open Evaluation Methodology v1.0, qariai.app/open-methodology"

Related Resources

Try QariAI Free

The app this methodology was built around. Check your Tajweed in real time — no signup needed.

Start Reciting →