Restarting the project.

Fixing unit tests. Continuing SyaParser
2026-04-12 09:40:04 +02:00
parent 3be854d34c
commit 078d8e5df6
15 changed files with 2351 additions and 1290 deletions
@@ -0,0 +1,545 @@
+# SyaConceptsParser
+
+## Purpose
+
+`SyaConceptsParser` parses **sequences of concepts with parameters** (variables).
+It complements `SimpleConceptsParser`, which only handles parameter-less concepts.
+
+Examples of recognized concepts:
+- `a plus b` → matches `1 plus 2`, `x plus y`, etc.
+- `if a then b end` → matches `if x > 0 then print x end`
+- `a long named concept b` → matches `1 long named concept 2`
+
+The primary goal is **concept composition**: `1 plus 2 times 3`, where `times` must
+be evaluated before `plus`. This precedence problem is what the Shunting Yard
+Algorithm solves.
+
+---
+
+## The Shunting Yard Algorithm (SYA)
+
+Dijkstra's algorithm (1961) converts an infix expression (`1 + 2 * 3`) into
+**Reverse Polish Notation** (RPN: `1 2 3 * +`), respecting operator precedence.
+
+### Principle
+
+Two structures: an **operator stack** and an **output queue**.
+
+```
+Input: 1  +  2  *  3
+
+         ┌──────────────────────────────────────────────┐
+Token    │ Action                  Stack      Output    │
+─────────┼──────────────────────────────────────────────┤
+1        │ → output queue          []         [1]       │
+        │ → stack                 [+]        [1]       │
+2        │ → output queue          [+]        [1, 2]    │
+*        │ prec(*) > prec(+)       [+, *]     [1, 2]    │
+         │ → stack (no pop)                             │
+3        │ → output queue          [+, *]     [1, 2, 3] │
+end      │ flush stack             []         [1,2,3,*,+] │
+└──────────────────────────────────────────────────────┘
+
+RPN: 1 2 3 * +  ≡  1 + (2 * 3)
+```
+
+### Pop rule
+
+When pushing operator `op`, first pop any stack-top operator `top` where:
+`precedence(top) >= precedence(op)`
+
+This ensures higher-precedence operators are evaluated first.
+
+---
+
+## Sheerka's Adaptation
+
+The original SYA works on **atomic tokens** (digits, `+`, `*`).
+Sheerka adapts it for **concepts** that:
+
+1. **Are identified by multiple tokens** — a concept like `if a then b end` has
+   several keywords (`if`, `then`, `end`) interleaved with parameters.
+   The original SYA identifies an operator with a single token.
+
+2. **Can have N parameters** — a binary operator has exactly 2 operands.
+   A Sheerka concept can have 0, 1, 2 or more parameters.
+
+3. **Parameters can themselves be concepts** — in `1 plus 2 times 3`, the parameter
+   `b` of `plus` is the result of the `times` concept. This recursion is handled
+   by the nested workflow structure.
+
+### SYA ↔ Sheerka mapping
+
+| Original SYA | Sheerka |
+|---|---|
+| Operator (`+`, `*`) | `ConceptToRecognize` (concept with parameters) |
+| Operand (number, variable) | `UnrecognizedToken` or `ConceptToken` |
+| Operator stack | `state_context.stack` |
+| Output queue | `state_context.parameters` |
+| Precedence rule | `InitConceptParsing.must_pop()` |
+| RPN result | `MetadataToken` in `state_context.result` |
+
+### Structural differences
+
+**Multi-token recognition** — where SYA reads a single token to identify `*`,
+Sheerka must read `long named concept` (3 tokens) to identify concept
+`a long named concept b`. The `ReadConcept` state handles this sequential reading.
+
+**The `expected` structure** — concept `if a then b end` is decomposed into segments:
+```
+[("if ", 0), (" then ", 1), (" end", 1)]
+ ─────────    ──────────     ──────────
+ keyword       keyword        keyword
+ 0 params      1 param        1 param
+ before        before         before
+```
+Each segment states how many parameters precede it and which tokens to consume
+to validate it.
+
+**Precedence not yet implemented** — `must_pop()` always returns `False`.
+Concept composition with precedence rules is the next implementation step.
+
+---
+
+## Architecture
+
+### Two interdependent workflows
+
+```mermaid
+graph TD
+    A[#tokens_wkf] -->|concept keyword found - fork| B[#concept_wkf]
+    A -->|token not a concept keyword - buffered, loop| A
+    B -->|concept fully parsed| A
+    A -->|EOF| C[end]
+```
+
+The parser always starts in `#tokens_wkf`. Tokens that do not match any concept
+keyword are accumulated in a buffer and the loop continues. Whenever a token
+matching the first keyword of a known concept is detected, a **fork** is created
+and sent into `#concept_wkf` — the main path keeps looping independently. Once the
+concept is recognized, the fork returns to `#tokens_wkf` to continue reading.
+
+---
+
+## Workflow `#tokens_wkf`
+
+```mermaid
+stateDiagram-v2
+    [*] --> start
+    start --> prepare_read_tokens
+    prepare_read_tokens --> read_tokens
+    read_tokens --> read_tokens : no concept found (loop)
+    read_tokens --> eof : EOF
+    read_tokens --> concepts_found : concept keyword detected (fork)
+    eof --> end : ManageUnrecognized
+    concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf
+    end --> [*]
+```
+
+**`PrepareReadTokens`**: initializes the buffer and records `buffer_start_pos`.
+
+**`ReadTokens`**: reads one token, calls `get_metadata_from_first_token`. If a concept
+can start at this token → **fork** with a cloned context where `concept_to_recognize`
+is set. The main path continues scanning.
+
+**`ManageUnrecognized("concepts found")`**: processes the buffer accumulated before
+the keyword (via `SimpleConceptsParser`). Unrecognized tokens become
+`UnrecognizedToken` and are added to `parameters`.
+
+---
+
+## Workflow `#concept_wkf`
+
+```mermaid
+stateDiagram-v2
+    [*] --> start
+    start --> init_concept_parsing
+    init_concept_parsing --> manage_parameters
+
+    manage_parameters --> read_concept
+
+    read_concept --> read_parameters : more segments
+    read_concept --> finalize_concept : all segments done
+    read_concept --> token_mismatch : token mismatch
+    read_concept --> error_eof : unexpected EOF
+
+    read_parameters --> manage_parameters : loop
+    read_parameters --> finalize_concept : EOF
+
+    finalize_concept --> tokens_wkf : #tokens_wkf
+    token_mismatch --> end
+    error_eof --> end
+    end --> [*]
+```
+
+**`InitConceptParsing`**:
+- Verifies the number of already-collected parameters is sufficient
+- Removes the first token of segment 0 (already consumed by `ReadTokens`)
+- Applies SYA: pushes the concept onto the stack
+
+**`ReadConcept`**: reads the fixed tokens of the current segment one by one.
+If all match → `pop(0)` the segment and continue.
+
+**`ReadParameters`**: reads ONE token into the buffer. Returns to
+`ManageUnrecognized` which tries to recognize it via `SimpleConceptsParser`.
+
+**`FinalizeConceptParsing`**:
+- Pops the concept from the stack
+- Computes `start` (from the first parameter) and `end` (current position)
+- Creates `MetadataToken(concept.metadata, start, end, resolution_method, "sya")`
+- Clears stack and parameters
+- Returns to `#tokens_wkf`
+
+---
+
+## Step-by-step example — `"1 plus 2"`
+
+Concept: `a plus b` (variables `a`, `b`).
+
+**Tokens:**
+```
+pos :  0    1      2       3    4    5
+tok : "1"  " "  "plus"   " "  "2"  EOF
+```
+
+**`expected` for this concept:**
+```
+[([" ", "plus", " "], 1), ([], 1)]
+   segment 0 → 1 param before, read " plus "
+   segment 1 → 1 param before, read nothing (concept ends with a param)
+```
+
+**Execution trace:**
+
+```
+PrepareReadTokens  → buffer_start_pos = 0
+
+ReadTokens "1"    → no concept, buffer = ["1"]
+ReadTokens " "    → no concept, buffer = ["1", " "]
+ReadTokens "plus" → concept "a plus b" found!
+
+  ┌── FORK ─────────────────────────────────────────────────────┐
+  │ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │
+  └─────────────────────────────────────────────────────────────┘
+
+ManageUnrecognized("concepts found")
+  buffer = ["1"," "] → SimpleConceptsParser → not found
+  parameters = [UnrecognizedToken("1 ", start=0, end=1)]
+  buffer_start_pos = 3
+  → #concept_wkf
+
+InitConceptParsing
+  expected[0] = ([" ","plus"," "], 1)
+  need 1 param → have 1 ✓
+  strip leading WS → ["plus"," "]
+  pop "plus" (already consumed) → [" "]
+  SYA: stack = [CTR(a_plus_b)]
+
+ManageUnrecognized("manage parameters"): buffer empty → nothing
+
+ReadConcept: reads [" "] → pos 3 = " " ✓
+  expected.pop(0) → remaining = [([], 1)]
+  → "read parameters"
+
+ReadParameters: reads "2" at pos 4
+  buffer = ["2"]
+  → "manage parameters"
+
+ManageUnrecognized("manage parameters")
+  buffer = ["2"] → not a concept
+  parameters = [UT("1 ", 0, 1), UT("2", 3, 3)]
+  buffer_start_pos = 5
+
+ReadConcept: expected = [([], 1)], reads 0 tokens
+  expected.pop(0) → empty → "finalize concept"
+
+FinalizeConceptParsing
+  concept = stack.pop() = CTR(a_plus_b)
+  start   = parameters[0].start = 0
+  end     = parser_input.pos    = 4
+  result.append(MetadataToken(metadata, 0, 4, "key", "sya"))
+  → #tokens_wkf
+
+ReadTokens → EOF → ManageUnrecognized("eof") → end
+```
+
+**Result:**
+```
+MultipleChoices([
+    [MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")]
+])
+```
+
+---
+
+## Example — sequence `"1 plus 2 3 plus 7"`
+
+Same concept `a plus b`. The parser recognizes two concepts in one pass.
+
+```
+pos :  0    1      2       3    4    5    6    7      8       9    10   11
+tok : "1"  " "  "plus"   " "  "2"  " "  "3"  " "  "plus"   " "  "7"  EOF
+```
+
+After `FinalizeConceptParsing` for the first concept (pos=4), `#tokens_wkf` restarts:
+
+```
+PrepareReadTokens   → buffer_start_pos = 5
+ReadTokens " "      → buffer = [" "]
+ReadTokens "3"      → buffer = [" ","3"]
+ReadTokens " "      → buffer = [" ","3"," "]
+ReadTokens "plus"   → fork
+
+ManageUnrecognized  → UT(" 3 ", start=5, end=7), buffer_start_pos=9
+...
+FinalizeConceptParsing
+  start = 5, end = 10
+  result.append(MetadataToken(1001, 5, 10, "key", "sya"))
+```
+
+**Final result (one path, two concepts):**
+```
+MultipleChoices([
+    [
+        MetadataToken(1001, start=0,  end=4,  parser="sya"),
+        MetadataToken(1001, start=5,  end=10, parser="sya"),
+    ]
+])
+```
+
+---
+
+## Future example — composition `"1 plus 2 times 3"`
+
+> **Note:** this example requires implementing `must_pop()`.
+> Currently `must_pop()` always returns `False`.
+
+Concepts: `a plus b` (low precedence), `a times b` (high precedence).
+
+**Expected behavior after implementation:**
+
+```
+Expression:  1  plus  2  times  3
+
+SYA with precedence times > plus:
+
+Token "1"      → parameters = [1]            stack = []
+Token "plus"   → stack = [plus]              parameters = [1]
+Token "2"      → parameters = [1, 2]         stack = [plus]
+Token "times"  → prec(times) > prec(plus) → no pop
+                  stack = [plus, times]       parameters = [1, 2]
+Token "3"      → parameters = [1, 2, 3]      stack = [plus, times]
+
+Finalize:
+  pop "times" → MetadataToken(times, params=[2, 3])
+  pop "plus"  → MetadataToken(plus,  params=[1, times_result])
+```
+
+**What `must_pop()` must implement:**
+```python
+def must_pop(self, current_concept, top_of_stack_concept):
+    return precedence(top_of_stack_concept) >= precedence(current_concept)
+```
+
+Without this rule, both concepts are processed left-to-right with equal precedence,
+yielding `(1 plus 2) times 3` instead of `1 plus (2 times 3)`.
+
+---
+
+## The `expected` structure in detail
+
+For concept `if a then b end` (key `"if __var__0 then __var__1 end"`):
+
+```
+_get_expected_tokens("if __var__0 then __var__1 end")
+
+→ [
+    (["if", " "],        0),   # read "if " before 1st param
+    ([" ", "then", " "], 1),   # read " then " before 2nd param
+    ([" ", "end"],       1),   # read " end" — 1 param before, no trailing param
+  ]
+```
+
+During parsing, `expected` is **modified in place**:
+- `InitConceptParsing` removes the first token of segment 0 (already read by `ReadTokens`)
+- `ReadConcept` consumes the tokens of the current segment then calls `pop(0)`
+- When `expected` is empty → `FinalizeConceptParsing`
+
+---
+
+## Key data structures
+
+### `StateMachineContext`
+
+```
+StateMachineContext
+├── parser_input         ParserInput       token stream + cursor
+├── other_parsers        [SimpleConceptsParser]
+├── buffer               list[Token]       tokens pending classification
+├── buffer_start_pos     int               start position of the current buffer
+├── concept_to_recognize ConceptToRecognize | None
+├── stack                list[CTR]         SYA — operator stack
+├── parameters           list[UT|CT]       SYA — output queue
+├── result               list[MetadataToken]
+└── errors               list
+```
+
+### `MetadataToken` (output)
+
+```
+MetadataToken
+├── metadata          ConceptMetadata   (id, name, key, variables, ...)
+├── start             int               position of the first token of the expression
+├── end               int               position of the last token
+├── resolution_method "key" | "name" | "id"
+└── parser            "sya"
+```
+
+### Token positions in `"1 plus 2"`:
+
+```
+"1 plus 2"
+ 0 1  2   3 4
+ │ │  │   │ │
+ 1 _  plus _ 2
+
+MetadataToken: start=0, end=4
+```
+
+---
+
+## Differences vs `SimpleConceptsParser`
+
+| | `SimpleConceptsParser` | `SyaConceptsParser` |
+|---|---|---|
+| Target concepts | No parameters | With parameters |
+| `concept_wkf` states | 2 | 8 |
+| `result` contents | `MetadataToken` + `UnrecognizedToken` | `MetadataToken` only |
+| Parameters | N/A | Collected in `parameters` list |
+| Parser tag | `"simple"` | `"sya"` |
+| SYA | No | Yes (precedence to implement) |
+
+---
+
+## Error handling
+
+| Error | Cause | State reached |
+|---|---|---|
+| `UnexpectedToken` | Read token ≠ expected concept token | `TokenMismatch` → `end` |
+| `UnexpectedEof` | Input ends before concept is complete | `ErrorEof` → `end` |
+| `NotEnoughParameters` | Too few params before a segment | Exception raised |
+
+Errors are collected from **all paths** and forwarded to `error_sink` in `parse()`.
+A path with errors is excluded from `_select_best_paths`.
+
+---
+
+## Known limitations and proposed improvements
+
+The current implementation correctly handles simple cases (single-token parameters,
+non-nested concepts). The following issues must be addressed before enabling
+precedence and real concept composition.
+
+### 1. Parameters are limited to a single token
+
+`ReadParameters` reads ONE token, then immediately calls `ManageUnrecognized`, which
+returns to `ReadConcept` to match the next keyword segment. Multi-token parameters
+therefore fail. For `if hello world then foo end` with parameter `a = "hello world"`:
+
+```
+ReadParameters reads "hello"
+ManageUnrecognized → UT("hello") → ReadConcept tries to match " then "
+ReadConcept reads " " ✓  then "world" ≠ "then" → MISMATCH
+```
+
+**Proposed fix:** `ReadParameters` should accumulate tokens until it detects the
+start of the next keyword segment (lookahead on `expected[0][0]`), then hand the
+full buffer to `ManageUnrecognized` for parsing in one pass.
+
+---
+
+### 2. Flat `parameters` list with no arity tracking
+
+When `FinalizeConceptParsing` runs, `parameters` is a flat list. There is no
+information about how many parameters belong to each concept on the stack. Once
+`must_pop` is active and multiple concepts are stacked, `FinalizeConceptParsing`
+cannot reconstruct the correct nesting.
+
+Example: `1 plus 2 times 3` with `stack = [plus, times]` and
+`parameters = [UT("1"), UT("2"), UT("3")]`. Without arity information there is no
+way to determine that `times` consumes the last two parameters and `plus` consumes
+the first one and the result of `times`.
+
+The arity of each concept (`nb_variables`) is available in `expected` at push time
+but is lost once `expected` is consumed during parsing.
+
+**Proposed fix:** record the arity of each concept when it is pushed onto the stack
+(in `apply_shunting_yard_algorithm`). `FinalizeConceptParsing` then pops the correct
+number of parameters for each concept, from innermost to outermost, building
+intermediate `MetadataToken` objects that are re-injected into `parameters` as
+`ConceptToken` before processing the next concept on the stack.
+
+---
+
+### 3. Type mismatch in `ManageUnrecognized` for recognized parameters
+
+When `SimpleConceptsParser` recognizes a token sequence, `ManageUnrecognized`
+creates:
+
+```python
+state_context.parameters.append(
+    ConceptToken(res.items[0], buffer_start_pos, parser_input.pos - 1)
+)
+```
+
+`res.items[0]` is a `list[MetadataToken]` (one complete path from
+`SimpleConceptsParser`), but `ConceptToken.concept` is typed as `Concept`. Any
+downstream code that uses this `ConceptToken` will receive a list where it expects a
+`Concept` instance.
+
+**Proposed fix:** define a dedicated container for a recognized parameter (e.g.
+`ParsedParameterToken`) that wraps a `list[MetadataToken]` with start/end positions,
+or flatten the result to a single `MetadataToken` when `res.items[0]` contains
+exactly one token.
+
+---
+
+### 4. Variable-to-parameter mapping not applied
+
+`FinalizeConceptParsing` creates a `MetadataToken` without populating the concept's
+variables. `parameters = [UT("1 "), UT("2")]` maps positionally to
+`variables = [("a", NotInit), ("b", NotInit)]`, but this mapping is never applied.
+The produced `MetadataToken` is therefore incomplete: a downstream evaluator has no
+way to retrieve parameter values from the token alone.
+
+**Proposed fix:** in `FinalizeConceptParsing`, zip `parameters` with
+`concept.metadata.variables` and store the result in the `MetadataToken`'s metadata,
+or pass it as a dedicated field.
+
+---
+
+### 5. `SyaConceptsParser` absent from `other_parsers`
+
+`other_parsers = [SimpleConceptsParser()]`. A parameter can be a simple (parameter-
+less) concept, but never a composite concept with parameters. True composition —
+where a parameter is itself a SYA-parsed concept — is structurally impossible with
+the current design.
+
+**Proposed fix:** add `SyaConceptsParser` to `other_parsers`. A guard is required
+to prevent infinite recursion: the nested instance should exclude the concept
+currently being recognized from its search space.
+
+---
+
+### Priority order
+
+| # | Issue | Blocking |
+|---|---|---|
+| 1 | Multi-token parameters | Practical usability |
+| 2 | `ConceptToken` type mismatch | Correctness |
+| 3 | Variable-to-parameter mapping | Evaluation pipeline |
+| 4 | Arity not tracked on the stack | `must_pop` / precedence |
+| 5 | `SyaConceptsParser` absent from `other_parsers` | Real composition |
+
+Issues 3 and 4 are interdependent with `must_pop`: implementing them independently
+(before activating precedence) is still valuable and lays the correct foundation.