# Functional Syntax Publication Bundle

This document captures the current paper-ready state of the project.

## Reference Layer

- Book-derived functional syntax analyzer is stable on the grouped book corpus.
- Final grouped book evaluation is exact.
- Weak queue is empty.
- The live unseen bank is frozen at 179 sentences and exact on category match.

## Generated Corpus

- Generated reference corpus size: 9,200 sentences
- Final-category rate: 0.885
- Keep / augmentation: 8,401 sentences
- Review / cleanup: 399 sentences
- Discard: 400 sentences
- Train-ready export: `functional_syntax/data/generated_reference_train_ready.tsv`
- Train/dev/test split:
  - `functional_syntax/data/generated_reference_train.tsv`
  - `functional_syntax/data/generated_reference_dev.tsv`
  - `functional_syntax/data/generated_reference_test.tsv`

## Sentence-Level Baseline

- Model: `functional_syntax/models/generated_reference_sentence_classifier.pt`
- Train accuracy: 0.9999
- Dev accuracy: 0.9226
- Test accuracy: 0.9155
- Test macro F1: 0.916

## Token-Level Baseline

- Token corpus:
  - `functional_syntax/data/generated_reference_token_train.tsv`
  - `functional_syntax/data/generated_reference_token_dev.tsv`
  - `functional_syntax/data/generated_reference_token_test.tsv`
- Model: `functional_syntax/models/generated_reference_token_bilstm_crf.pt`
- Train semantic accuracy: 0.9971
- Dev semantic accuracy: 0.9192
- Test semantic accuracy: 0.9214
- Test syntactic accuracy: 0.9208
- Test pragmatic accuracy: 0.9469
- Test joint sequence accuracy: 0.7738

## Narrative for Paper

The current paper story is:

1. A book-grounded reference analyzer provides stable 4-layer outputs.
2. The generated corpus is validated and split into training-ready data.
3. A sentence-family baseline shows the weak labels are learnable.
4. A token-level BiLSTM-CRF baseline learns semantic, syntactic, and pragmatic labels with strong token accuracy.
5. Structural placement remains rule-derived from the book template.

## Publication Summary

- `functional_syntax/docs/paper_prep_master_notes.md`
- `functional_syntax/docs/paper_outline_bullets.md`
- `functional_syntax/docs/paper_paragraph_starters.md`
- `functional_syntax/docs/paper_evidence_list.md`
- `functional_syntax/docs/analyzer_one_page_note.md`
- `functional_syntax/docs/wazan_tashkeel_clue_table.md`
- `functional_syntax/docs/paper_publication_summary.md`
- `functional_syntax/data/paper_publication_summary.json`
- `functional_syntax/docs/paper_methods_results_draft.md`
- `functional_syntax/docs/related_research_map.md`
- `functional_syntax/docs/external_data_use_matrix.md`
- `functional_syntax/docs/external_data_download_plan.md`
- `functional_syntax/docs/external_data_fetch_checklist.md`
- `functional_syntax/docs/external_data_import_guide.md`
- `functional_syntax/docs/external_syntax_corpus.md`
- `functional_syntax/docs/paper_publication_package.zip`

## Recommended Claim Scope

- Safe claim: a hybrid Arabic functional syntax system combining rule-based structure with learned semantic/syntactic/pragmatic labeling.
- Avoid claiming the token model is gold-supervised; it is trained on analyzer-derived weak labels.
- Keep the book corpus as the reference layer and the generated corpus as augmentation.
