# External Data Use Matrix for Tasreef

This matrix lists the external Arabic resources that are most useful for Tasreef and how to use each one.

## Quick Summary

| Resource | Best Role | Recommended Use |
|---|---|---|
| [UD Arabic-PADT](https://universaldependencies.org/treebanks/ar_padt/index.html) | Syntactic training | Main external syntax source |
| [UD Arabic-PUD](https://universaldependencies.org/treebanks/ar_pud/index.html) | Syntax test set | Hold-out comparison only |
| [QuranMorph](https://arxiv.org/abs/2506.18148) | Morphology / POS | Token-level morphology and closed-class reliability |
| [Quranic Arabic Corpus](https://corpus.quran.com/) | Classical Arabic syntax | Manual correctness checks and classical-style tests |
| [AraSEG](https://arxiv.org/abs/2606.08025) | Segmentation robustness | Sentence boundary / preprocessing stress tests |

## Detailed Matrix

| Resource | Size / Scope | Annotation Type | License / Access | Best Tasreef Layer | Fetch / Use Decision | Notes |
|---|---|---|---|---|---|---|
| [UD Arabic-PADT](https://universaldependencies.org/treebanks/ar_padt/index.html) | 7,664 sentences, 282,384 tokens; newswire | Manual morphosyntax converted to UD | CC BY-NC-SA 3.0 | Syntax | **Fetch and use for training** | Strongest external syntax source; good for comparison against the syntactic head and dependency-like behavior |
| [UD Arabic-PUD](https://universaldependencies.org/treebanks/ar_pud/index.html) | 1,000 sentences; news + wiki | Manually annotated UPOS/features/relations, converted to UD | CC BY-SA 3.0 | Syntax | **Fetch and hold out for testing** | Best used as a clean external test set; do not train on it if you want a stricter comparison |
| [QuranMorph](https://arxiv.org/abs/2506.18148) | 77,429 tokens | Manual lemma + POS by 3 expert linguists | Open on SinaLab | Morphology / POS | **Fetch and use for morphology checks** | Useful for closed-class items, fine-grained POS, and lemma reliability |
| [Quranic Arabic Corpus](https://corpus.quran.com/) | 77,430 words | Morphology, dependency syntax, semantic ontology | Public website; corpus-specific terms apply | Morphology + classical syntax | **Use selectively for testing and validation** | Best for comparing classical/grammar-style analysis and word-by-word structure |
| [AraSEG](https://arxiv.org/abs/2606.08025) | 8 genres; sentence segmentation corpus | Sentence segmentation under varied punctuation | Code/data/models reported as public in paper | Preprocessing / segmentation | **Fetch if available; use for preprocessing robustness** | Relevant because Tasreef should handle weak punctuation and segmentation noise before analysis |

## What to Fetch First

1. **UD Arabic-PADT**
   - Use as the primary external syntax source.
   - Train or pretrain syntactic components against it.

2. **QuranMorph**
   - Use to stabilize morphology and POS for classical and closed-class forms.
   - Good for checking ambiguous words and fine-grained tags.

3. **UD Arabic-PUD**
   - Use only as a held-out external syntax test.
   - Keeps evaluation honest.

4. **Quranic Arabic Corpus**
   - Use selectively for classical sentence tests and manual comparison.
   - Best for verifying grammar-style analyses rather than broad-domain statistics.

5. **AraSEG**
   - Use if you want a preprocessing benchmark for sentence segmentation.
   - Especially useful if you later expose raw-text input in the UI.

## Layer Mapping

| Tasreef Layer | Best External Sources |
|---|---|
| Morphology | QuranMorph, Quranic Arabic Corpus |
| Syntax | UD Arabic-PADT, UD Arabic-PUD, Quranic Arabic Corpus |
| Pragmatics | Mostly internal/book-derived; external corpora are weak here |
| Structural | Mostly internal/book-derived; external corpora are weak here |

## Recommended Policy

- Use **UD Arabic-PADT** for syntax transfer and benchmark comparison.
- Use **QuranMorph** to harden morphology and closed-class handling.
- Use **UD Arabic-PUD** as an external hold-out set.
- Use **Quranic Arabic Corpus** for classical Arabic validation and manual spot checks.
- Use **AraSEG** for segmentation robustness if it is accessible under usable terms.

## What Not to Do

- Do not treat external corpora as replacement for the book reference layer.
- Do not mix held-out test resources into training.
- Do not expect external corpora to cover the pragmatic or structural layer directly.

## Paper Framing

The external resources strengthen Tasreef in the layers where the book-derived system is most vulnerable to transfer:

- syntax generalization
- morphology and closed-class stability
- tokenization / segmentation robustness

But the book remains the source of truth for:

- the sentence-position template
- pragmatic placement
- structural layer assignment
