๐Ÿ“„ Open Methodology ยท v1.0 ยท March 2026

Open Evaluation Framework for AI Quran Recitation Tools

For full technical specification, see our open methodology on GitHub.

๐Ÿ“… Published March 2026 ๐Ÿ›๏ธ QariAI Research โš–๏ธ CC BY 4.0 โ€” Free to use and cite ๐Ÿ”— qariai.app/open-methodology

A reproducible, openly licensed protocol for evaluating AI-powered Quranic recitation tools across five dimensions. Anyone โ€” developers, researchers, educators, or users โ€” can apply these tests to any tool.

DISCLOSURE: QariAI is both the publisher and a subject of this evaluation methodology. We believe transparency about this dual role strengthens rather than undermines it. We invite independent researchers, scholars, and competing products to adopt, critique, and improve these criteria.

1. Purpose & Scope

This document defines a practical testing protocol that anyone can apply to assess how well a recitation tool performs its stated functions. It is not a certification or seal of approval. It draws on established research in Quranic speech recognition, including work on the Ar-DAD dataset and the knowledge-centric evaluation approach described in Hakami et al. (2025, arXiv:2510.12858).

What this document is

What this document is not

2. Evaluation Dimensions

The methodology assesses recitation tools across five dimensions, each scored on a 1โ€“5 scale.

DimensionWhat It Measures
Phoneme AccuracyCorrect identification of individual Arabic phonemes, especially minimal pairs
Tajweed Rule DetectionDetecting Ghunnah, Idghaam, Ikhfaa, Qalqalah, Madd, and Laam rules
Timing & ProsodyMadd duration, waqf/ibtidaa, recitation speed classification
Feedback QualitySpecificity, actionability, and pedagogical accuracy of AI feedback
Demographic RobustnessEqual performance across accents, ages, and genders
Dimension 1

Phoneme Accuracy

Measures whether the AI can correctly distinguish individual Arabic sounds, particularly those that are phonetically close and commonly confused by learners. Test set: minimum 120 isolated phoneme pairs covering ุต/ุณ, ุถ/ุฏ, ุท/ุช, ุธ/ุฐ, ุน/ุฃ, ุบ/ุฎ, ู‚/ูƒ, ุซ/ุณ, ุฐ/ุฒ. Minimum 3 speakers per pair: native Arab, non-native intermediate, beginner.

ScoreCriteria
595%+ correct identification across all categories and speaker types
485โ€“94% accuracy; occasional errors on hardest pairs
370โ€“84% accuracy; reliable on emphatic pairs but weak on halqi distinctions
250โ€“69% accuracy; frequent confusion on 2+ phoneme categories
1Below 50%; effectively guessing or using word-level ASR without phoneme analysis
Dimension 2

Tajweed Rule Detection

Evaluates whether the tool can detect application or misapplication of specific tajweed rules during connected recitation. Mandatory rules: Ghunnah, Idghaam (with/without ghunnah), Ikhfaa (before 15 letters), Qalqalah (ู‚ุทุจุฌุฏ at sukoon/waqf), Madd (natural 2, connected 4โ€“5, necessary 6 counts), Laam tafkheem/tarqeeq in "Allah".

ScoreCriteria
5Detects and names all 6 mandatory rules with 90%+ accuracy
4Detects 5/6 rules reliably; occasional misses on subtle rules
3Detects 3โ€“4 rules; typically catches ghunnah and qalqalah
2Detects 1โ€“2 rules; mostly word-level error detection
1No tajweed-specific detection; tool only identifies word substitutions
Dimension 3

Timing & Prosody

Assesses rhythmic and durational aspects: madd duration measurement, ghunnah duration, waqf/ibtidaa, and recitation speed classification (tahqeeq, tadweer, hadr). Tests use 10 verses with madd of different lengths, recorded at three speeds, with deliberately shortened/extended madd.

ScoreCriteria
5Measures duration within 0.5 count; adapts to speed; identifies waqf appropriateness
3Detects major timing errors but lacks precision on borderline cases
1No timing analysis; tool only evaluates at word/phoneme level
Dimension 4

Feedback Quality

A tool's detection capability is only useful if translated into clear, actionable, pedagogically appropriate feedback. Criteria: Specificity (exact location, rule, error nature), Actionability (what to do differently), Pedagogical sensitivity (adapts to user level), Accuracy (correctly describes the issue).

ScoreCriteria
590%+ feedback rated accurate and actionable by reviewing scholar; adapts to user level
3Mostly accurate but often vague; user knows something is wrong but not what to fix
1Generic feedback ("good"/"needs improvement") or frequently inaccurate
Dimension 5

Demographic Robustness

AI speech systems are biased toward the demographic composition of their training data. This dimension ensures global accessibility across Arabic dialects, South Asian, Southeast Asian, African, and Western backgrounds; all genders; children, adolescents, adults, and elderly speakers; beginner to advanced proficiency.

ScoreCriteria
5No demographic group drops more than 5% below overall average; tested across 6+ language backgrounds
31โ€“2 demographic groups show 10โ€“15% accuracy drops; tested across 3โ€“4 language backgrounds
1Large accuracy gaps (>20%) across demographics, or only tested on a single demographic group

3. Limitations

4. Academic References

References

โ€ข Hakami et al. (2025). "A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation." arXiv:2510.12858.

โ€ข Atef et al. (2023). "Quran Recitation Recognition using End-to-End Deep Learning." arXiv:2305.07034.

โ€ข Al-Ayyoub et al. (2023). "Speech Recognition Models for Holy Quran Recitation." IJACSA, Vol. 14, No. 12.

โ€ข TajweedAI (2025). "A Hybrid ASR-Classifier for Real-Time Qalqalah Detection." NeurIPS 2025.

โ€ข Tarteel AI (2024โ€“2025). ML Journey blog series.

License: Released under Creative Commons Attribution 4.0 (CC BY 4.0). Free to share, adapt, and build upon with appropriate credit: "QariAI Open Evaluation Methodology v1.0, qariai.app/open-methodology"

Try QariAI Free

The app this methodology was built around. Check your Tajweed in real time โ€” no signup needed.

Start Reciting โ†’