# SyaConceptsParser ## Purpose `SyaConceptsParser` parses **sequences of concepts with parameters** (variables). It complements `SimpleConceptsParser`, which only handles parameter-less concepts. Examples of recognized concepts: - `a plus b` → matches `1 plus 2`, `x plus y`, etc. - `if a then b end` → matches `if x > 0 then print x end` - `a long named concept b` → matches `1 long named concept 2` The primary goal is **concept composition**: `1 plus 2 times 3`, where `times` must be evaluated before `plus`. This precedence problem is what the Shunting Yard Algorithm solves. --- ## The Shunting Yard Algorithm (SYA) Dijkstra's algorithm (1961) converts an infix expression (`1 + 2 * 3`) into **Reverse Polish Notation** (RPN: `1 2 3 * +`), respecting operator precedence. ### Principle Two structures: an **operator stack** and an **output queue**. ``` Input: 1 + 2 * 3 ┌──────────────────────────────────────────────┐ Token │ Action Stack Output │ ─────────┼──────────────────────────────────────────────┤ 1 │ → output queue [] [1] │ + │ → stack [+] [1] │ 2 │ → output queue [+] [1, 2] │ * │ prec(*) > prec(+) [+, *] [1, 2] │ │ → stack (no pop) │ 3 │ → output queue [+, *] [1, 2, 3] │ end │ flush stack [] [1,2,3,*,+] │ └──────────────────────────────────────────────────────┘ RPN: 1 2 3 * + ≡ 1 + (2 * 3) ``` ### Pop rule When pushing operator `op`, first pop any stack-top operator `top` where: `precedence(top) >= precedence(op)` This ensures higher-precedence operators are evaluated first. --- ## Sheerka's Adaptation The original SYA works on **atomic tokens** (digits, `+`, `*`). Sheerka adapts it for **concepts** that: 1. **Are identified by multiple tokens** — a concept like `if a then b end` has several keywords (`if`, `then`, `end`) interleaved with parameters. The original SYA identifies an operator with a single token. 2. **Can have N parameters** — a binary operator has exactly 2 operands. A Sheerka concept can have 0, 1, 2 or more parameters. 3. **Parameters can themselves be concepts** — in `1 plus 2 times 3`, the parameter `b` of `plus` is the result of the `times` concept. This recursion is handled by the nested workflow structure. ### SYA ↔ Sheerka mapping | Original SYA | Sheerka | |---|---| | Operator (`+`, `*`) | `ConceptToRecognize` (concept with parameters) | | Operand (number, variable) | `UnrecognizedToken` or `ConceptToken` | | Operator stack | `state_context.stack` | | Output queue | `state_context.parameters` | | Precedence rule | `InitConceptParsing.must_pop()` | | RPN result | `MetadataToken` in `state_context.result` | ### Structural differences **Multi-token recognition** — where SYA reads a single token to identify `*`, Sheerka must read `long named concept` (3 tokens) to identify concept `a long named concept b`. The `ReadConcept` state handles this sequential reading. **The `expected` structure** — concept `if a then b end` is decomposed into segments: ``` [("if ", 0), (" then ", 1), (" end", 1)] ───────── ────────── ────────── keyword keyword keyword 0 params 1 param 1 param before before before ``` Each segment states how many parameters precede it and which tokens to consume to validate it. **Precedence not yet implemented** — `must_pop()` always returns `False`. Concept composition with precedence rules is the next implementation step. --- ## Architecture ### Two interdependent workflows ```mermaid graph TD A[#tokens_wkf] -->|concept keyword found - fork| B[#concept_wkf] A -->|token not a concept keyword - buffered, loop| A B -->|concept fully parsed| A A -->|EOF| C[end] ``` The parser always starts in `#tokens_wkf`. Tokens that do not match any concept keyword are accumulated in a buffer and the loop continues. Whenever a token matching the first keyword of a known concept is detected, a **fork** is created and sent into `#concept_wkf` — the main path keeps looping independently. Once the concept is recognized, the fork returns to `#tokens_wkf` to continue reading. --- ## Workflow `#tokens_wkf` ```mermaid stateDiagram-v2 [*] --> start start --> prepare_read_tokens prepare_read_tokens --> read_tokens read_tokens --> read_tokens : no concept found (loop) read_tokens --> eof : EOF read_tokens --> concepts_found : concept keyword detected (fork) eof --> end : ManageUnrecognized concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf end --> [*] ``` **`PrepareReadTokens`**: initializes the buffer and records `buffer_start_pos`. **`ReadTokens`**: reads one token, calls `get_metadata_from_first_token`. If a concept can start at this token → **fork** with a cloned context where `concept_to_recognize` is set. The main path continues scanning. **`ManageUnrecognized("concepts found")`**: processes the buffer accumulated before the keyword (via `SimpleConceptsParser`). Unrecognized tokens become `UnrecognizedToken` and are added to `parameters`. --- ## Workflow `#concept_wkf` ```mermaid stateDiagram-v2 [*] --> start start --> init_concept_parsing init_concept_parsing --> manage_parameters manage_parameters --> read_concept read_concept --> read_parameters : more segments read_concept --> finalize_concept : all segments done read_concept --> token_mismatch : token mismatch read_concept --> error_eof : unexpected EOF read_parameters --> manage_parameters : loop read_parameters --> finalize_concept : EOF finalize_concept --> tokens_wkf : #tokens_wkf token_mismatch --> end error_eof --> end end --> [*] ``` **`InitConceptParsing`**: - Verifies the number of already-collected parameters is sufficient - Removes the first token of segment 0 (already consumed by `ReadTokens`) - Applies SYA: pushes the concept onto the stack **`ReadConcept`**: reads the fixed tokens of the current segment one by one. If all match → `pop(0)` the segment and continue. **`ReadParameters`**: reads ONE token into the buffer. Returns to `ManageUnrecognized` which tries to recognize it via `SimpleConceptsParser`. **`FinalizeConceptParsing`**: - Pops the concept from the stack - Computes `start` (from the first parameter) and `end` (current position) - Creates `MetadataToken(concept.metadata, start, end, resolution_method, "sya")` - Clears stack and parameters - Returns to `#tokens_wkf` --- ## Step-by-step example — `"1 plus 2"` Concept: `a plus b` (variables `a`, `b`). **Tokens:** ``` pos : 0 1 2 3 4 5 tok : "1" " " "plus" " " "2" EOF ``` **`expected` for this concept:** ``` [([" ", "plus", " "], 1), ([], 1)] segment 0 → 1 param before, read " plus " segment 1 → 1 param before, read nothing (concept ends with a param) ``` **Execution trace:** ``` PrepareReadTokens → buffer_start_pos = 0 ReadTokens "1" → no concept, buffer = ["1"] ReadTokens " " → no concept, buffer = ["1", " "] ReadTokens "plus" → concept "a plus b" found! ┌── FORK ─────────────────────────────────────────────────────┐ │ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │ └─────────────────────────────────────────────────────────────┘ ManageUnrecognized("concepts found") buffer = ["1"," "] → SimpleConceptsParser → not found parameters = [UnrecognizedToken("1 ", start=0, end=1)] buffer_start_pos = 3 → #concept_wkf InitConceptParsing expected[0] = ([" ","plus"," "], 1) need 1 param → have 1 ✓ strip leading WS → ["plus"," "] pop "plus" (already consumed) → [" "] SYA: stack = [CTR(a_plus_b)] ManageUnrecognized("manage parameters"): buffer empty → nothing ReadConcept: reads [" "] → pos 3 = " " ✓ expected.pop(0) → remaining = [([], 1)] → "read parameters" ReadParameters: reads "2" at pos 4 buffer = ["2"] → "manage parameters" ManageUnrecognized("manage parameters") buffer = ["2"] → not a concept parameters = [UT("1 ", 0, 1), UT("2", 3, 3)] buffer_start_pos = 5 ReadConcept: expected = [([], 1)], reads 0 tokens expected.pop(0) → empty → "finalize concept" FinalizeConceptParsing concept = stack.pop() = CTR(a_plus_b) start = parameters[0].start = 0 end = parser_input.pos = 4 result.append(MetadataToken(metadata, 0, 4, "key", "sya")) → #tokens_wkf ReadTokens → EOF → ManageUnrecognized("eof") → end ``` **Result:** ``` MultipleChoices([ [MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")] ]) ``` --- ## Example — sequence `"1 plus 2 3 plus 7"` Same concept `a plus b`. The parser recognizes two concepts in one pass. ``` pos : 0 1 2 3 4 5 6 7 8 9 10 11 tok : "1" " " "plus" " " "2" " " "3" " " "plus" " " "7" EOF ``` After `FinalizeConceptParsing` for the first concept (pos=4), `#tokens_wkf` restarts: ``` PrepareReadTokens → buffer_start_pos = 5 ReadTokens " " → buffer = [" "] ReadTokens "3" → buffer = [" ","3"] ReadTokens " " → buffer = [" ","3"," "] ReadTokens "plus" → fork ManageUnrecognized → UT(" 3 ", start=5, end=7), buffer_start_pos=9 ... FinalizeConceptParsing start = 5, end = 10 result.append(MetadataToken(1001, 5, 10, "key", "sya")) ``` **Final result (one path, two concepts):** ``` MultipleChoices([ [ MetadataToken(1001, start=0, end=4, parser="sya"), MetadataToken(1001, start=5, end=10, parser="sya"), ] ]) ``` --- ## Future example — composition `"1 plus 2 times 3"` > **Note:** this example requires implementing `must_pop()`. > Currently `must_pop()` always returns `False`. Concepts: `a plus b` (low precedence), `a times b` (high precedence). **Expected behavior after implementation:** ``` Expression: 1 plus 2 times 3 SYA with precedence times > plus: Token "1" → parameters = [1] stack = [] Token "plus" → stack = [plus] parameters = [1] Token "2" → parameters = [1, 2] stack = [plus] Token "times" → prec(times) > prec(plus) → no pop stack = [plus, times] parameters = [1, 2] Token "3" → parameters = [1, 2, 3] stack = [plus, times] Finalize: pop "times" → MetadataToken(times, params=[2, 3]) pop "plus" → MetadataToken(plus, params=[1, times_result]) ``` **What `must_pop()` must implement:** ```python def must_pop(self, current_concept, top_of_stack_concept): return precedence(top_of_stack_concept) >= precedence(current_concept) ``` Without this rule, both concepts are processed left-to-right with equal precedence, yielding `(1 plus 2) times 3` instead of `1 plus (2 times 3)`. --- ## The `expected` structure in detail For concept `if a then b end` (key `"if __var__0 then __var__1 end"`): ``` _get_expected_tokens("if __var__0 then __var__1 end") → [ (["if", " "], 0), # read "if " before 1st param ([" ", "then", " "], 1), # read " then " before 2nd param ([" ", "end"], 1), # read " end" — 1 param before, no trailing param ] ``` During parsing, `expected` is **modified in place**: - `InitConceptParsing` removes the first token of segment 0 (already read by `ReadTokens`) - `ReadConcept` consumes the tokens of the current segment then calls `pop(0)` - When `expected` is empty → `FinalizeConceptParsing` --- ## Key data structures ### `StateMachineContext` ``` StateMachineContext ├── parser_input ParserInput token stream + cursor ├── other_parsers [SimpleConceptsParser] ├── buffer list[Token] tokens pending classification ├── buffer_start_pos int start position of the current buffer ├── concept_to_recognize ConceptToRecognize | None ├── stack list[CTR] SYA — operator stack ├── parameters list[UT|CT] SYA — output queue ├── result list[MetadataToken] └── errors list ``` ### `MetadataToken` (output) ``` MetadataToken ├── metadata ConceptMetadata (id, name, key, variables, ...) ├── start int position of the first token of the expression ├── end int position of the last token ├── resolution_method "key" | "name" | "id" └── parser "sya" ``` ### Token positions in `"1 plus 2"`: ``` "1 plus 2" 0 1 2 3 4 │ │ │ │ │ 1 _ plus _ 2 MetadataToken: start=0, end=4 ``` --- ## Differences vs `SimpleConceptsParser` | | `SimpleConceptsParser` | `SyaConceptsParser` | |---|---|---| | Target concepts | No parameters | With parameters | | `concept_wkf` states | 2 | 8 | | `result` contents | `MetadataToken` + `UnrecognizedToken` | `MetadataToken` only | | Parameters | N/A | Collected in `parameters` list | | Parser tag | `"simple"` | `"sya"` | | SYA | No | Yes (precedence to implement) | --- ## Error handling | Error | Cause | State reached | |---|---|---| | `UnexpectedToken` | Read token ≠ expected concept token | `TokenMismatch` → `end` | | `UnexpectedEof` | Input ends before concept is complete | `ErrorEof` → `end` | | `NotEnoughParameters` | Too few params before a segment | Exception raised | Errors are collected from **all paths** and forwarded to `error_sink` in `parse()`. A path with errors is excluded from `_select_best_paths`. --- ## Known limitations and proposed improvements The current implementation correctly handles simple cases (single-token parameters, non-nested concepts). The following issues must be addressed before enabling precedence and real concept composition. ### 1. Parameters are limited to a single token `ReadParameters` reads ONE token, then immediately calls `ManageUnrecognized`, which returns to `ReadConcept` to match the next keyword segment. Multi-token parameters therefore fail. For `if hello world then foo end` with parameter `a = "hello world"`: ``` ReadParameters reads "hello" ManageUnrecognized → UT("hello") → ReadConcept tries to match " then " ReadConcept reads " " ✓ then "world" ≠ "then" → MISMATCH ``` **Proposed fix:** `ReadParameters` should accumulate tokens until it detects the start of the next keyword segment (lookahead on `expected[0][0]`), then hand the full buffer to `ManageUnrecognized` for parsing in one pass. --- ### 2. Flat `parameters` list with no arity tracking When `FinalizeConceptParsing` runs, `parameters` is a flat list. There is no information about how many parameters belong to each concept on the stack. Once `must_pop` is active and multiple concepts are stacked, `FinalizeConceptParsing` cannot reconstruct the correct nesting. Example: `1 plus 2 times 3` with `stack = [plus, times]` and `parameters = [UT("1"), UT("2"), UT("3")]`. Without arity information there is no way to determine that `times` consumes the last two parameters and `plus` consumes the first one and the result of `times`. The arity of each concept (`nb_variables`) is available in `expected` at push time but is lost once `expected` is consumed during parsing. **Proposed fix:** record the arity of each concept when it is pushed onto the stack (in `apply_shunting_yard_algorithm`). `FinalizeConceptParsing` then pops the correct number of parameters for each concept, from innermost to outermost, building intermediate `MetadataToken` objects that are re-injected into `parameters` as `ConceptToken` before processing the next concept on the stack. --- ### 3. Type mismatch in `ManageUnrecognized` for recognized parameters When `SimpleConceptsParser` recognizes a token sequence, `ManageUnrecognized` creates: ```python state_context.parameters.append( ConceptToken(res.items[0], buffer_start_pos, parser_input.pos - 1) ) ``` `res.items[0]` is a `list[MetadataToken]` (one complete path from `SimpleConceptsParser`), but `ConceptToken.concept` is typed as `Concept`. Any downstream code that uses this `ConceptToken` will receive a list where it expects a `Concept` instance. **Proposed fix:** define a dedicated container for a recognized parameter (e.g. `ParsedParameterToken`) that wraps a `list[MetadataToken]` with start/end positions, or flatten the result to a single `MetadataToken` when `res.items[0]` contains exactly one token. --- ### 4. Variable-to-parameter mapping not applied `FinalizeConceptParsing` creates a `MetadataToken` without populating the concept's variables. `parameters = [UT("1 "), UT("2")]` maps positionally to `variables = [("a", NotInit), ("b", NotInit)]`, but this mapping is never applied. The produced `MetadataToken` is therefore incomplete: a downstream evaluator has no way to retrieve parameter values from the token alone. **Proposed fix:** in `FinalizeConceptParsing`, zip `parameters` with `concept.metadata.variables` and store the result in the `MetadataToken`'s metadata, or pass it as a dedicated field. --- ### 5. `SyaConceptsParser` absent from `other_parsers` `other_parsers = [SimpleConceptsParser()]`. A parameter can be a simple (parameter- less) concept, but never a composite concept with parameters. True composition — where a parameter is itself a SYA-parsed concept — is structurally impossible with the current design. **Proposed fix:** add `SyaConceptsParser` to `other_parsers`. A guard is required to prevent infinite recursion: the nested instance should exclude the concept currently being recognized from its search space. --- ### Priority order | # | Issue | Blocking | |---|---|---| | 1 | Multi-token parameters | Practical usability | | 2 | `ConceptToken` type mismatch | Correctness | | 3 | Variable-to-parameter mapping | Evaluation pipeline | | 4 | Arity not tracked on the stack | `must_pop` / precedence | | 5 | `SyaConceptsParser` absent from `other_parsers` | Real composition | Issues 3 and 4 are interdependent with `must_pop`: implementing them independently (before activating precedence) is still valuable and lays the correct foundation.