# Tasreef Paper Evidence List

Date: 2026-06-19

This file collects the concrete evidence to cite under each paper section.

## 1. Introduction Evidence

- System type: hybrid Arabic functional-syntax analyzer
- Core claim: rule-based reference layer + weakly supervised baselines
- User-facing output: category, slot, semantic, syntactic, pragmatic, structural labels + Arabic summary

Source docs:
- `functional_syntax/docs/paper_prep_master_notes.md`
- `functional_syntax/docs/paper_publication_summary.md`
- `functional_syntax/docs/paper_paragraph_starters.md`

## 2. Related Work Evidence

Use these as the main comparison anchors:

- Arabic morphosyntactic tagging with pretrained language models
- adversarial multitask morphological modeling
- self-training for Arabic sequence labeling
- Arabic morphosyntactic tagging and dependency parsing with LLMs
- QuranMorph
- UD Arabic-PADT
- UD Arabic-PUD
- syntax-aware semantic role labeling

Source docs:
- `functional_syntax/docs/related_research_map.md`

## 3. Book Basis / Functional Scheme Evidence

- The book-derived analyzer is exact on the grouped book corpus.
- The book coverage is strong for:
  - fronting
  - exception
  - coordination
  - interrogation
  - causative
  - core clause families
- The main remaining book-granularity points are:
  - f7 severed adjective
  - f8 severed coordination
  - f9 embedded / attached clauses
  - f10 external field / `الرَّبَض`
  - f16 pragmatic implication
  - most concrete chapter examples for these areas are already promoted; the remaining gap is theory granularity, not missing coverage

Source docs:
- `functional_syntax/docs/book_coverage_matrix.md`
- `functional_syntax/docs/book_chapter_reread_notes.md`

## 4. System Architecture Evidence

- Rule-based reference analyzer is the source of truth for structure.
- Learned baselines are trained on analyzer-derived weak labels.
- UI includes Arabic summaries, certainty labels, slot legend, and batch analysis.

Source docs:
- `functional_syntax/docs/paper_publication_summary.md`
- `functional_syntax/docs/paper_outline_bullets.md`
- `functional_syntax/docs/paper_paragraph_starters.md`

## 5. Data Construction Evidence

### Generated corpus

- Total sentences: 9,200
- Final-category rate: 0.885
- Keep / augmentation: 8,401
- Review / cleanup: 399
- Discard: 400
- Patterns: 23

### Training-ready export

- Input rows: 8,401
- Accepted rows: 8,142
- Rejected rows: 259
- Acceptance rate: 0.969

### Train/dev/test split

- Train sentences: 6,514
- Dev sentences: 814
- Test sentences: 814

Source docs:
- `functional_syntax/docs/generated_reference_eval.md`
- `functional_syntax/docs/generated_reference_training.md`
- `functional_syntax/docs/generated_reference_workflow.md`
- `functional_syntax/docs/generated_reference_split.md`

## 6. Evaluation Evidence

### Reference layer

- Grouped book corpus: exact
- Frozen 179-sentence unseen bank: exact category match

### Sentence-level baseline

- Train accuracy: 0.9999
- Dev accuracy: 0.9226
- Test accuracy: 0.9155
- Test macro F1: 0.916

### Token-level baseline

- Train semantic accuracy: 0.9971
- Dev semantic accuracy: 0.9192
- Test semantic accuracy: 0.9214
- Test syntactic accuracy: 0.9208
- Test pragmatic accuracy: 0.9469
- Test joint sequence accuracy: 0.7738

### User stress test

- 100 scenarios tested
- Good: 100
- Suspicious: 0
- Bad: 0

Source docs:
- `functional_syntax/docs/paper_publication_summary.md`
- `functional_syntax/docs/generated_reference_training.md`
- `functional_syntax/docs/user_scenario_test_suite_100_eval.md`

## 7. Error Analysis Evidence

Main patterns found in the generated audit:

- helper / fallback hits remain concentrated in a few families
- the large difficult families were:
  - negation variants
  - restriction variants
  - passive
  - wh / choice questions
- most remaining wrong results were due to:
  - precedence
  - normalization
  - theory-granularity distinctions

Source docs:
- `functional_syntax/docs/generated_reference_eval.md`
- `functional_syntax/docs/book_chapter_reread_notes.md`

## 8. Discussion / Limitations Evidence

- f7 / f8 / f9 / f10 / f16 are the main theory-granularity areas
- the pragmatic layer is best described as a proxy, not full discourse modeling
- the token model is weakly supervised, not gold supervised
- the runtime structural layer is practical, not a complete formalization of every book distinction

Source docs:
- `functional_syntax/docs/book_chapter_reread_notes.md`
- `functional_syntax/docs/book_coverage_matrix.md`
- `functional_syntax/docs/paper_prep_master_notes.md`

## 9. External Data Evidence

Potential comparison / extension corpora:

- UD Arabic-PADT
- UD Arabic-PUD
- QuranMorph
- Quranic Arabic Corpus
- AraSEG

Source docs:
- `functional_syntax/docs/external_data_use_matrix.md`
- `functional_syntax/docs/external_data_download_plan.md`
- `functional_syntax/docs/external_data_fetch_checklist.md`
- `functional_syntax/docs/external_syntax_corpus.md`

## 10. Suggested Paper Claims

Use these exact-safe claims:

- Tasreef is a hybrid Arabic functional-syntax system.
- The reference layer is book-grounded.
- The learned baselines are trained on weak labels.
- The main remaining gaps are theory granularity and robustness, not the absence of a core analyzer.

Source docs:
- `functional_syntax/docs/paper_prep_master_notes.md`
- `functional_syntax/docs/paper_outline_bullets.md`
- `functional_syntax/docs/paper_paragraph_starters.md`