Restarting the project.

Fixing unit tests. Continuing SyaParser
This commit is contained in:
2026-04-12 09:40:04 +02:00
parent 3be854d34c
commit 078d8e5df6
15 changed files with 2351 additions and 1290 deletions
+433
View File
@@ -0,0 +1,433 @@
# SyaConceptsParser
## Purpose
`SyaConceptsParser` parse des **séquences de concepts avec paramètres** (variables).
Il complète `SimpleConceptsParser` qui, lui, ne gère que les concepts sans paramètres.
Exemples de concepts reconnus :
- `a plus b` → reconnaît `1 plus 2`, `x plus y plus z`, etc.
- `if a then b end` → reconnaît `if x > 0 then print x end`
- `a long named concept b` → reconnaît `1 long named concept 2`
Le cas fondamental visé est la **composition de concepts** : `1 plus 2 times 3`, où
`times` doit être évalué avant `plus`. C'est ce problème de précédence que résout le
Shunting Yard Algorithm.
---
## Le Shunting Yard Algorithm (SYA)
Algorithme de Dijkstra (1961) qui convertit une expression en notation infixe
(`1 + 2 * 3`) en **notation polonaise inverse** (RPN : `1 2 3 * +`), en respectant
la précédence des opérateurs.
### Principe
Deux structures : un **stack d'opérateurs** et une **queue de sortie**.
```
Entrée : 1 + 2 * 3
┌──────────────────────────────────────────────┐
Token │ Action Stack Sortie │
─────────┼──────────────────────────────────────────────┤
1 │ → sortie [] [1] │
+ │ → stack [+] [1] │
2 │ → sortie [+] [1, 2] │
* │ prec(*) > prec(+) [+, *] [1, 2] │
│ → stack (pas de pop) [1, 2] │
3 │ → sortie [+, *] [1, 2, 3] │
fin │ vider stack [] [1, 2, 3, *, +] │
└──────────────────────────────────────────────────────┘
RPN : 1 2 3 * + ≡ 1 + (2 * 3)
```
### Règle de pop
Quand on empile un opérateur `op`, on dépile d'abord tout opérateur `top` tel que :
`précédence(top) >= précédence(op)`
Cela garantit que les opérateurs de haute précédence sont évalués en premier.
---
## Adaptation dans Sheerka
Le SYA original travaille sur des **tokens atomiques** (chiffres, `+`, `*`).
Sheerka l'adapte pour travailler sur des **concepts** qui :
1. **S'identifient par plusieurs tokens** — un concept comme `if a then b end`
contient plusieurs mots-clés (`if`, `then`, `end`) entrelacés avec des paramètres.
L'algorithme original reconnaît un opérateur en un seul token.
2. **Peuvent contenir N paramètres** — un opérateur binaire a exactement 2 opérandes.
Un concept Sheerka peut en avoir 0, 1, 2 ou plus.
3. **Les paramètres peuvent eux-mêmes être des concepts** — dans `1 plus 2 times 3`,
le paramètre `b` de `plus` est le résultat du concept `times`. La récursion est
gérée par l'imbrication des workflows.
### Correspondance SYA ↔ Sheerka
| SYA original | Sheerka |
|---|---|
| Opérateur (`+`, `*`) | `ConceptToRecognize` (concept avec paramètres) |
| Opérande (nombre, variable) | `UnrecognizedToken` ou `ConceptToken` |
| Stack d'opérateurs | `state_context.stack` |
| Queue de sortie | `state_context.parameters` |
| Précédence | `InitConceptParsing.must_pop()` |
| Résultat RPN | `MetadataToken` dans `state_context.result` |
### Différences structurelles
**Reconnaissance multi-tokens** — là où SYA lit un token pour identifier `*`,
Sheerka doit lire `long named concept` (3 tokens) pour identifier le concept
`a long named concept b`. La classe `ReadConcept` gère cette lecture séquentielle.
**Structure `expected`** — le concept `if a then b end` est décomposé en segments :
```
[("if ", 0), (" then ", 1), (" end", 1)]
──────── ────────── ──────────
keyword keyword keyword
0 params 1 param 1 param
avant avant avant
```
Chaque segment indique combien de paramètres précèdent ce groupe de tokens, et
quels tokens consommer pour valider ce segment.
**Précédence non encore implémentée**`must_pop()` retourne toujours `False`.
La composition de concepts n'est donc pas encore active. C'est la prochaine étape
d'implémentation.
---
## Architecture
### Deux workflows interdépendants
```mermaid
graph TD
A[#tokens_wkf] -->|concept keyword found| B[#concept_wkf]
B -->|concept fully parsed| A
A -->|EOF| C[end]
```
Le parser démarre toujours dans `#tokens_wkf`. À chaque fois qu'un mot-clé
correspondant au premier token d'un concept est détecté, un **fork** est créé et
envoyé dans `#concept_wkf`. Une fois le concept reconnu, on revient dans
`#tokens_wkf` pour continuer la lecture.
---
## Workflow `#tokens_wkf`
```mermaid
stateDiagram-v2
[*] --> start
start --> prepare_read_tokens
prepare_read_tokens --> read_tokens
read_tokens --> read_tokens : no concept found (loop)
read_tokens --> eof : EOF
read_tokens --> concepts_found : concept keyword detected (fork)
eof --> end : ManageUnrecognized
concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf
end --> [*]
```
**`PrepareReadTokens`** : initialise le buffer et mémorise `buffer_start_pos`.
**`ReadTokens`** : lit un token, consulte `get_metadata_from_first_token`. Si un concept
peut démarrer à ce token → **fork** avec un clone du contexte où `concept_to_recognize`
est renseigné. Le chemin principal continue à lire.
**`ManageUnrecognized("concepts found")`** : traite le buffer accumulé avant le
mot-clé (via `SimpleConceptsParser`). Les tokens non reconnus deviennent
`UnrecognizedToken` et sont ajoutés à `parameters`.
---
## Workflow `#concept_wkf`
```mermaid
stateDiagram-v2
[*] --> start
start --> init_concept_parsing
init_concept_parsing --> manage_parameters
manage_parameters --> read_concept
read_concept --> read_parameters : more segments
read_concept --> finalize_concept : all segments done
read_concept --> token_mismatch : token mismatch
read_concept --> error_eof : unexpected EOF
read_parameters --> manage_parameters : loop
read_parameters --> finalize_concept : EOF
finalize_concept --> tokens_wkf : #tokens_wkf
token_mismatch --> end
error_eof --> end
end --> [*]
```
**`InitConceptParsing`** :
- Vérifie que le nombre de paramètres déjà collectés est suffisant
- Retire le premier token du segment (déjà consommé par `ReadTokens`)
- Applique le SYA : empile le concept sur le stack
**`ReadConcept`** : lit les tokens fixes du segment courant un par un.
Si tous correspondent → `pop(0)` du segment et continue.
**`ReadParameters`** : lit UN token dans le buffer. Revient à
`ManageUnrecognized` qui tente de le reconnaître via `SimpleConceptsParser`.
**`FinalizeConceptParsing`** :
- Dépile le concept du stack
- Calcule `start` (depuis le premier paramètre) et `end` (position courante)
- Crée un `MetadataToken(concept.metadata, start, end, resolution_method, "sya")`
- Vide stack et parameters
- Retourne à `#tokens_wkf`
---
## Exemple pas à pas — `"1 plus 2"`
Concept défini : `a plus b` (variables `a`, `b`).
**Tokens :**
```
pos : 0 1 2 3 4 5
tok : "1" " " "plus" " " "2" EOF
```
**`expected` pour ce concept :**
```
[([" ", "plus", " "], 1), ([], 1)]
segment 0 → 1 param avant, lire " plus "
segment 1 → 1 param avant, lire rien (concept se termine par un param)
```
**Déroulé :**
```
PrepareReadTokens → buffer_start_pos = 0
ReadTokens "1" → no concept, buffer = ["1"]
ReadTokens " " → no concept, buffer = ["1", " "]
ReadTokens "plus" → concept "a plus b" trouvé !
┌── FORK ──────────────────────────────────────────────────────┐
│ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │
└──────────────────────────────────────────────────────────────┘
ManageUnrecognized("concepts found")
buffer = ["1"," "] → SimpleConceptsParser → not found
parameters = [UT("1 ", start=0, end=1)]
buffer_start_pos = 3
→ #concept_wkf
InitConceptParsing
expected[0] = ([" ","plus"," "], 1)
need 1 param → have 1 ✓
strip leading WS → ["plus"," "]
pop "plus" (déjà lu) → [" "]
SYA: stack = [CTR(a_plus_b)]
ManageUnrecognized("manage parameters") : buffer vide → rien
ReadConcept : lit [" "] → pos 3 = " " ✓
expected.pop(0) → remaining = [([], 1)]
→ "read parameters"
ReadParameters : lit "2" (pos 4)
buffer = ["2"]
→ "manage parameters"
ManageUnrecognized("manage parameters")
buffer = ["2"] → not a concept
parameters = [UT("1 ", 0, 1), UT("2", 3, 3)]
buffer_start_pos = 5
ReadConcept : expected = [([], 1)], lit 0 tokens
expected.pop(0) → empty → "finalize concept"
FinalizeConceptParsing
concept = stack.pop() = CTR(a_plus_b)
start = parameters[0].start = 0
end = parser_input.pos = 4
result.append(MetadataToken(metadata, 0, 4, "key", "sya"))
→ #tokens_wkf
ReadTokens → EOF → ManageUnrecognized("eof") → end
```
**Résultat :**
```
MultipleChoices([
[MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")]
])
```
---
## Exemple — séquence `"1 plus 2 3 plus 7"`
Même concept `a plus b`. Le parser reconnaît deux concepts successifs dans un seul passage.
```
pos : 0 1 2 3 4 5 6 7 8 9 10 11
tok : "1" " " "plus" " " "2" " " "3" " " "plus" " " "7" EOF
```
Après `FinalizeConceptParsing` du premier concept (pos=4), `#tokens_wkf` repart :
```
PrepareReadTokens → buffer_start_pos = 5
ReadTokens " " → buffer = [" "]
ReadTokens "3" → buffer = [" ","3"]
ReadTokens " " → buffer = [" ","3"," "]
ReadTokens "plus" → fork
ManageUnrecognized → UT(" 3 ", start=5, end=7), buffer_start_pos=9
...
FinalizeConceptParsing
start = 5, end = 10
result.append(MetadataToken(1001, 5, 10, "key", "sya"))
```
**Résultat final (un seul path, deux concepts) :**
```
MultipleChoices([
[
MetadataToken(1001, start=0, end=4, parser="sya"),
MetadataToken(1001, start=5, end=10, parser="sya"),
]
])
```
---
## Exemple futur — composition `"1 plus 2 times 3"`
> **Note :** cet exemple nécessite l'implémentation de `must_pop()`.
> Aujourd'hui `must_pop()` retourne toujours `False`.
Concepts : `a plus b` (basse précédence), `a times b` (haute précédence).
**Comportement attendu après implémentation :**
```
Expression : 1 plus 2 times 3
SYA avec précédence times > plus :
Token "1" → parameters = [1] stack = []
Token "plus" → stack = [plus] parameters = [1]
Token "2" → parameters = [1, 2] stack = [plus]
Token "times" → prec(times) > prec(plus) → pas de pop
stack = [plus, times] parameters = [1, 2]
Token "3" → parameters = [1, 2, 3] stack = [plus, times]
Finalize :
pop "times" → MetadataToken(times, params=[2, 3])
pop "plus" → MetadataToken(plus, params=[1, times_result])
```
**Ce que `must_pop()` doit implémenter :**
```python
def must_pop(self, current_concept, top_of_stack_concept):
return precedence(top_of_stack_concept) >= precedence(current_concept)
```
Sans cette règle, les deux concepts seraient traités de gauche à droite avec la même
précédence, ce qui donnerait `(1 plus 2) times 3` au lieu de `1 plus (2 times 3)`.
---
## Structure `expected` en détail
Pour le concept `if a then b end` (clé `"if __var__0 then __var__1 end"`) :
```
_get_expected_tokens("if __var__0 then __var__1 end")
→ [
(["if", " "], 0), # lire "if " avant le 1er param
([" ", "then", " "], 1), # lire " then " avant le 2ème param
([" ", "end"], 1), # lire " end" avant le 3ème... non, 1 param avant
]
```
Pendant le parsing, `expected` est **modifié en place** :
- `InitConceptParsing` retire le premier token du segment 0 (déjà lu par `ReadTokens`)
- `ReadConcept` consomme les tokens du segment courant puis fait `pop(0)`
- Quand `expected` est vide → `FinalizeConceptParsing`
---
## Structures de données clés
### `StateMachineContext`
```
StateMachineContext
├── parser_input ParserInput flux de tokens + curseur
├── other_parsers [SimpleConceptsParser]
├── buffer list[Token] tokens en attente de classification
├── buffer_start_pos int position de début du buffer courant
├── concept_to_recognize ConceptToRecognize | None
├── stack list[CTR] SYA — stack d'opérateurs
├── parameters list[UT|CT] SYA — queue de sortie
├── result list[MetadataToken]
└── errors list
```
### `MetadataToken` (sortie)
```
MetadataToken
├── metadata ConceptMetadata (id, name, key, variables, ...)
├── start int position du 1er token de l'expression
├── end int position du dernier token
├── resolution_method "key" | "name" | "id"
└── parser "sya"
```
### Positions dans `"1 plus 2"` :
```
"1 plus 2"
0 1 2 3 4
│ │ │ │ │
1 _ plus _ 2
MetadataToken : start=0, end=4
```
---
## Différences avec `SimpleConceptsParser`
| | `SimpleConceptsParser` | `SyaConceptsParser` |
|---|---|---|
| Concepts ciblés | Sans paramètres | Avec paramètres |
| `concept_wkf` | 2 états | 8 états |
| Contenu de `result` | `MetadataToken` + `UnrecognizedToken` | `MetadataToken` uniquement |
| Paramètres | N/A | Collectés dans `parameters` |
| Parser tag | `"simple"` | `"sya"` |
| SYA | Non | Oui (précédence à implémenter) |
---
## Gestion des erreurs
| Erreur | Cause | État atteint |
|---|---|---|
| `UnexpectedToken` | Token lu ≠ token attendu du concept | `TokenMismatch``end` |
| `UnexpectedEof` | Fin de l'entrée avant fin du concept | `ErrorEof``end` |
| `NotEnoughParameters` | Pas assez de params avant un segment | Exception levée |
Les erreurs sont collectées depuis **tous les paths** et transmises à `error_sink` dans
`parse()`. Un path avec erreurs est exclu de `_select_best_paths`.
+545
View File
@@ -0,0 +1,545 @@
# SyaConceptsParser
## Purpose
`SyaConceptsParser` parses **sequences of concepts with parameters** (variables).
It complements `SimpleConceptsParser`, which only handles parameter-less concepts.
Examples of recognized concepts:
- `a plus b` → matches `1 plus 2`, `x plus y`, etc.
- `if a then b end` → matches `if x > 0 then print x end`
- `a long named concept b` → matches `1 long named concept 2`
The primary goal is **concept composition**: `1 plus 2 times 3`, where `times` must
be evaluated before `plus`. This precedence problem is what the Shunting Yard
Algorithm solves.
---
## The Shunting Yard Algorithm (SYA)
Dijkstra's algorithm (1961) converts an infix expression (`1 + 2 * 3`) into
**Reverse Polish Notation** (RPN: `1 2 3 * +`), respecting operator precedence.
### Principle
Two structures: an **operator stack** and an **output queue**.
```
Input: 1 + 2 * 3
┌──────────────────────────────────────────────┐
Token │ Action Stack Output │
─────────┼──────────────────────────────────────────────┤
1 │ → output queue [] [1] │
+ │ → stack [+] [1] │
2 │ → output queue [+] [1, 2] │
* │ prec(*) > prec(+) [+, *] [1, 2] │
│ → stack (no pop) │
3 │ → output queue [+, *] [1, 2, 3] │
end │ flush stack [] [1,2,3,*,+] │
└──────────────────────────────────────────────────────┘
RPN: 1 2 3 * + ≡ 1 + (2 * 3)
```
### Pop rule
When pushing operator `op`, first pop any stack-top operator `top` where:
`precedence(top) >= precedence(op)`
This ensures higher-precedence operators are evaluated first.
---
## Sheerka's Adaptation
The original SYA works on **atomic tokens** (digits, `+`, `*`).
Sheerka adapts it for **concepts** that:
1. **Are identified by multiple tokens** — a concept like `if a then b end` has
several keywords (`if`, `then`, `end`) interleaved with parameters.
The original SYA identifies an operator with a single token.
2. **Can have N parameters** — a binary operator has exactly 2 operands.
A Sheerka concept can have 0, 1, 2 or more parameters.
3. **Parameters can themselves be concepts** — in `1 plus 2 times 3`, the parameter
`b` of `plus` is the result of the `times` concept. This recursion is handled
by the nested workflow structure.
### SYA ↔ Sheerka mapping
| Original SYA | Sheerka |
|---|---|
| Operator (`+`, `*`) | `ConceptToRecognize` (concept with parameters) |
| Operand (number, variable) | `UnrecognizedToken` or `ConceptToken` |
| Operator stack | `state_context.stack` |
| Output queue | `state_context.parameters` |
| Precedence rule | `InitConceptParsing.must_pop()` |
| RPN result | `MetadataToken` in `state_context.result` |
### Structural differences
**Multi-token recognition** — where SYA reads a single token to identify `*`,
Sheerka must read `long named concept` (3 tokens) to identify concept
`a long named concept b`. The `ReadConcept` state handles this sequential reading.
**The `expected` structure** — concept `if a then b end` is decomposed into segments:
```
[("if ", 0), (" then ", 1), (" end", 1)]
───────── ────────── ──────────
keyword keyword keyword
0 params 1 param 1 param
before before before
```
Each segment states how many parameters precede it and which tokens to consume
to validate it.
**Precedence not yet implemented**`must_pop()` always returns `False`.
Concept composition with precedence rules is the next implementation step.
---
## Architecture
### Two interdependent workflows
```mermaid
graph TD
A[#tokens_wkf] -->|concept keyword found - fork| B[#concept_wkf]
A -->|token not a concept keyword - buffered, loop| A
B -->|concept fully parsed| A
A -->|EOF| C[end]
```
The parser always starts in `#tokens_wkf`. Tokens that do not match any concept
keyword are accumulated in a buffer and the loop continues. Whenever a token
matching the first keyword of a known concept is detected, a **fork** is created
and sent into `#concept_wkf` — the main path keeps looping independently. Once the
concept is recognized, the fork returns to `#tokens_wkf` to continue reading.
---
## Workflow `#tokens_wkf`
```mermaid
stateDiagram-v2
[*] --> start
start --> prepare_read_tokens
prepare_read_tokens --> read_tokens
read_tokens --> read_tokens : no concept found (loop)
read_tokens --> eof : EOF
read_tokens --> concepts_found : concept keyword detected (fork)
eof --> end : ManageUnrecognized
concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf
end --> [*]
```
**`PrepareReadTokens`**: initializes the buffer and records `buffer_start_pos`.
**`ReadTokens`**: reads one token, calls `get_metadata_from_first_token`. If a concept
can start at this token → **fork** with a cloned context where `concept_to_recognize`
is set. The main path continues scanning.
**`ManageUnrecognized("concepts found")`**: processes the buffer accumulated before
the keyword (via `SimpleConceptsParser`). Unrecognized tokens become
`UnrecognizedToken` and are added to `parameters`.
---
## Workflow `#concept_wkf`
```mermaid
stateDiagram-v2
[*] --> start
start --> init_concept_parsing
init_concept_parsing --> manage_parameters
manage_parameters --> read_concept
read_concept --> read_parameters : more segments
read_concept --> finalize_concept : all segments done
read_concept --> token_mismatch : token mismatch
read_concept --> error_eof : unexpected EOF
read_parameters --> manage_parameters : loop
read_parameters --> finalize_concept : EOF
finalize_concept --> tokens_wkf : #tokens_wkf
token_mismatch --> end
error_eof --> end
end --> [*]
```
**`InitConceptParsing`**:
- Verifies the number of already-collected parameters is sufficient
- Removes the first token of segment 0 (already consumed by `ReadTokens`)
- Applies SYA: pushes the concept onto the stack
**`ReadConcept`**: reads the fixed tokens of the current segment one by one.
If all match → `pop(0)` the segment and continue.
**`ReadParameters`**: reads ONE token into the buffer. Returns to
`ManageUnrecognized` which tries to recognize it via `SimpleConceptsParser`.
**`FinalizeConceptParsing`**:
- Pops the concept from the stack
- Computes `start` (from the first parameter) and `end` (current position)
- Creates `MetadataToken(concept.metadata, start, end, resolution_method, "sya")`
- Clears stack and parameters
- Returns to `#tokens_wkf`
---
## Step-by-step example — `"1 plus 2"`
Concept: `a plus b` (variables `a`, `b`).
**Tokens:**
```
pos : 0 1 2 3 4 5
tok : "1" " " "plus" " " "2" EOF
```
**`expected` for this concept:**
```
[([" ", "plus", " "], 1), ([], 1)]
segment 0 → 1 param before, read " plus "
segment 1 → 1 param before, read nothing (concept ends with a param)
```
**Execution trace:**
```
PrepareReadTokens → buffer_start_pos = 0
ReadTokens "1" → no concept, buffer = ["1"]
ReadTokens " " → no concept, buffer = ["1", " "]
ReadTokens "plus" → concept "a plus b" found!
┌── FORK ─────────────────────────────────────────────────────┐
│ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │
└─────────────────────────────────────────────────────────────┘
ManageUnrecognized("concepts found")
buffer = ["1"," "] → SimpleConceptsParser → not found
parameters = [UnrecognizedToken("1 ", start=0, end=1)]
buffer_start_pos = 3
→ #concept_wkf
InitConceptParsing
expected[0] = ([" ","plus"," "], 1)
need 1 param → have 1 ✓
strip leading WS → ["plus"," "]
pop "plus" (already consumed) → [" "]
SYA: stack = [CTR(a_plus_b)]
ManageUnrecognized("manage parameters"): buffer empty → nothing
ReadConcept: reads [" "] → pos 3 = " " ✓
expected.pop(0) → remaining = [([], 1)]
→ "read parameters"
ReadParameters: reads "2" at pos 4
buffer = ["2"]
→ "manage parameters"
ManageUnrecognized("manage parameters")
buffer = ["2"] → not a concept
parameters = [UT("1 ", 0, 1), UT("2", 3, 3)]
buffer_start_pos = 5
ReadConcept: expected = [([], 1)], reads 0 tokens
expected.pop(0) → empty → "finalize concept"
FinalizeConceptParsing
concept = stack.pop() = CTR(a_plus_b)
start = parameters[0].start = 0
end = parser_input.pos = 4
result.append(MetadataToken(metadata, 0, 4, "key", "sya"))
→ #tokens_wkf
ReadTokens → EOF → ManageUnrecognized("eof") → end
```
**Result:**
```
MultipleChoices([
[MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")]
])
```
---
## Example — sequence `"1 plus 2 3 plus 7"`
Same concept `a plus b`. The parser recognizes two concepts in one pass.
```
pos : 0 1 2 3 4 5 6 7 8 9 10 11
tok : "1" " " "plus" " " "2" " " "3" " " "plus" " " "7" EOF
```
After `FinalizeConceptParsing` for the first concept (pos=4), `#tokens_wkf` restarts:
```
PrepareReadTokens → buffer_start_pos = 5
ReadTokens " " → buffer = [" "]
ReadTokens "3" → buffer = [" ","3"]
ReadTokens " " → buffer = [" ","3"," "]
ReadTokens "plus" → fork
ManageUnrecognized → UT(" 3 ", start=5, end=7), buffer_start_pos=9
...
FinalizeConceptParsing
start = 5, end = 10
result.append(MetadataToken(1001, 5, 10, "key", "sya"))
```
**Final result (one path, two concepts):**
```
MultipleChoices([
[
MetadataToken(1001, start=0, end=4, parser="sya"),
MetadataToken(1001, start=5, end=10, parser="sya"),
]
])
```
---
## Future example — composition `"1 plus 2 times 3"`
> **Note:** this example requires implementing `must_pop()`.
> Currently `must_pop()` always returns `False`.
Concepts: `a plus b` (low precedence), `a times b` (high precedence).
**Expected behavior after implementation:**
```
Expression: 1 plus 2 times 3
SYA with precedence times > plus:
Token "1" → parameters = [1] stack = []
Token "plus" → stack = [plus] parameters = [1]
Token "2" → parameters = [1, 2] stack = [plus]
Token "times" → prec(times) > prec(plus) → no pop
stack = [plus, times] parameters = [1, 2]
Token "3" → parameters = [1, 2, 3] stack = [plus, times]
Finalize:
pop "times" → MetadataToken(times, params=[2, 3])
pop "plus" → MetadataToken(plus, params=[1, times_result])
```
**What `must_pop()` must implement:**
```python
def must_pop(self, current_concept, top_of_stack_concept):
return precedence(top_of_stack_concept) >= precedence(current_concept)
```
Without this rule, both concepts are processed left-to-right with equal precedence,
yielding `(1 plus 2) times 3` instead of `1 plus (2 times 3)`.
---
## The `expected` structure in detail
For concept `if a then b end` (key `"if __var__0 then __var__1 end"`):
```
_get_expected_tokens("if __var__0 then __var__1 end")
→ [
(["if", " "], 0), # read "if " before 1st param
([" ", "then", " "], 1), # read " then " before 2nd param
([" ", "end"], 1), # read " end" — 1 param before, no trailing param
]
```
During parsing, `expected` is **modified in place**:
- `InitConceptParsing` removes the first token of segment 0 (already read by `ReadTokens`)
- `ReadConcept` consumes the tokens of the current segment then calls `pop(0)`
- When `expected` is empty → `FinalizeConceptParsing`
---
## Key data structures
### `StateMachineContext`
```
StateMachineContext
├── parser_input ParserInput token stream + cursor
├── other_parsers [SimpleConceptsParser]
├── buffer list[Token] tokens pending classification
├── buffer_start_pos int start position of the current buffer
├── concept_to_recognize ConceptToRecognize | None
├── stack list[CTR] SYA — operator stack
├── parameters list[UT|CT] SYA — output queue
├── result list[MetadataToken]
└── errors list
```
### `MetadataToken` (output)
```
MetadataToken
├── metadata ConceptMetadata (id, name, key, variables, ...)
├── start int position of the first token of the expression
├── end int position of the last token
├── resolution_method "key" | "name" | "id"
└── parser "sya"
```
### Token positions in `"1 plus 2"`:
```
"1 plus 2"
0 1 2 3 4
│ │ │ │ │
1 _ plus _ 2
MetadataToken: start=0, end=4
```
---
## Differences vs `SimpleConceptsParser`
| | `SimpleConceptsParser` | `SyaConceptsParser` |
|---|---|---|
| Target concepts | No parameters | With parameters |
| `concept_wkf` states | 2 | 8 |
| `result` contents | `MetadataToken` + `UnrecognizedToken` | `MetadataToken` only |
| Parameters | N/A | Collected in `parameters` list |
| Parser tag | `"simple"` | `"sya"` |
| SYA | No | Yes (precedence to implement) |
---
## Error handling
| Error | Cause | State reached |
|---|---|---|
| `UnexpectedToken` | Read token ≠ expected concept token | `TokenMismatch``end` |
| `UnexpectedEof` | Input ends before concept is complete | `ErrorEof``end` |
| `NotEnoughParameters` | Too few params before a segment | Exception raised |
Errors are collected from **all paths** and forwarded to `error_sink` in `parse()`.
A path with errors is excluded from `_select_best_paths`.
---
## Known limitations and proposed improvements
The current implementation correctly handles simple cases (single-token parameters,
non-nested concepts). The following issues must be addressed before enabling
precedence and real concept composition.
### 1. Parameters are limited to a single token
`ReadParameters` reads ONE token, then immediately calls `ManageUnrecognized`, which
returns to `ReadConcept` to match the next keyword segment. Multi-token parameters
therefore fail. For `if hello world then foo end` with parameter `a = "hello world"`:
```
ReadParameters reads "hello"
ManageUnrecognized → UT("hello") → ReadConcept tries to match " then "
ReadConcept reads " " ✓ then "world" ≠ "then" → MISMATCH
```
**Proposed fix:** `ReadParameters` should accumulate tokens until it detects the
start of the next keyword segment (lookahead on `expected[0][0]`), then hand the
full buffer to `ManageUnrecognized` for parsing in one pass.
---
### 2. Flat `parameters` list with no arity tracking
When `FinalizeConceptParsing` runs, `parameters` is a flat list. There is no
information about how many parameters belong to each concept on the stack. Once
`must_pop` is active and multiple concepts are stacked, `FinalizeConceptParsing`
cannot reconstruct the correct nesting.
Example: `1 plus 2 times 3` with `stack = [plus, times]` and
`parameters = [UT("1"), UT("2"), UT("3")]`. Without arity information there is no
way to determine that `times` consumes the last two parameters and `plus` consumes
the first one and the result of `times`.
The arity of each concept (`nb_variables`) is available in `expected` at push time
but is lost once `expected` is consumed during parsing.
**Proposed fix:** record the arity of each concept when it is pushed onto the stack
(in `apply_shunting_yard_algorithm`). `FinalizeConceptParsing` then pops the correct
number of parameters for each concept, from innermost to outermost, building
intermediate `MetadataToken` objects that are re-injected into `parameters` as
`ConceptToken` before processing the next concept on the stack.
---
### 3. Type mismatch in `ManageUnrecognized` for recognized parameters
When `SimpleConceptsParser` recognizes a token sequence, `ManageUnrecognized`
creates:
```python
state_context.parameters.append(
ConceptToken(res.items[0], buffer_start_pos, parser_input.pos - 1)
)
```
`res.items[0]` is a `list[MetadataToken]` (one complete path from
`SimpleConceptsParser`), but `ConceptToken.concept` is typed as `Concept`. Any
downstream code that uses this `ConceptToken` will receive a list where it expects a
`Concept` instance.
**Proposed fix:** define a dedicated container for a recognized parameter (e.g.
`ParsedParameterToken`) that wraps a `list[MetadataToken]` with start/end positions,
or flatten the result to a single `MetadataToken` when `res.items[0]` contains
exactly one token.
---
### 4. Variable-to-parameter mapping not applied
`FinalizeConceptParsing` creates a `MetadataToken` without populating the concept's
variables. `parameters = [UT("1 "), UT("2")]` maps positionally to
`variables = [("a", NotInit), ("b", NotInit)]`, but this mapping is never applied.
The produced `MetadataToken` is therefore incomplete: a downstream evaluator has no
way to retrieve parameter values from the token alone.
**Proposed fix:** in `FinalizeConceptParsing`, zip `parameters` with
`concept.metadata.variables` and store the result in the `MetadataToken`'s metadata,
or pass it as a dedicated field.
---
### 5. `SyaConceptsParser` absent from `other_parsers`
`other_parsers = [SimpleConceptsParser()]`. A parameter can be a simple (parameter-
less) concept, but never a composite concept with parameters. True composition —
where a parameter is itself a SYA-parsed concept — is structurally impossible with
the current design.
**Proposed fix:** add `SyaConceptsParser` to `other_parsers`. A guard is required
to prevent infinite recursion: the nested instance should exclude the concept
currently being recognized from its search space.
---
### Priority order
| # | Issue | Blocking |
|---|---|---|
| 1 | Multi-token parameters | Practical usability |
| 2 | `ConceptToken` type mismatch | Correctness |
| 3 | Variable-to-parameter mapping | Evaluation pipeline |
| 4 | Arity not tracked on the stack | `must_pop` / precedence |
| 5 | `SyaConceptsParser` absent from `other_parsers` | Real composition |
Issues 3 and 4 are interdependent with `must_pop`: implementing them independently
(before activating precedence) is still valuable and lays the correct foundation.