Files
Sheerka/docs/FEAT-021-sya-concepts-parser.md
T
kodjo 078d8e5df6 Restarting the project.
Fixing unit tests. Continuing SyaParser
2026-04-12 09:40:04 +02:00

18 KiB

SyaConceptsParser

Purpose

SyaConceptsParser parses sequences of concepts with parameters (variables). It complements SimpleConceptsParser, which only handles parameter-less concepts.

Examples of recognized concepts:

  • a plus b → matches 1 plus 2, x plus y, etc.
  • if a then b end → matches if x > 0 then print x end
  • a long named concept b → matches 1 long named concept 2

The primary goal is concept composition: 1 plus 2 times 3, where times must be evaluated before plus. This precedence problem is what the Shunting Yard Algorithm solves.


The Shunting Yard Algorithm (SYA)

Dijkstra's algorithm (1961) converts an infix expression (1 + 2 * 3) into Reverse Polish Notation (RPN: 1 2 3 * +), respecting operator precedence.

Principle

Two structures: an operator stack and an output queue.

Input: 1  +  2  *  3

         ┌──────────────────────────────────────────────┐
Token    │ Action                  Stack      Output    │
─────────┼──────────────────────────────────────────────┤
1        │ → output queue          []         [1]       │
+        │ → stack                 [+]        [1]       │
2        │ → output queue          [+]        [1, 2]    │
*        │ prec(*) > prec(+)       [+, *]     [1, 2]    │
         │ → stack (no pop)                             │
3        │ → output queue          [+, *]     [1, 2, 3] │
end      │ flush stack             []         [1,2,3,*,+] │
└──────────────────────────────────────────────────────┘

RPN: 1 2 3 * +  ≡  1 + (2 * 3)

Pop rule

When pushing operator op, first pop any stack-top operator top where: precedence(top) >= precedence(op)

This ensures higher-precedence operators are evaluated first.


Sheerka's Adaptation

The original SYA works on atomic tokens (digits, +, *). Sheerka adapts it for concepts that:

  1. Are identified by multiple tokens — a concept like if a then b end has several keywords (if, then, end) interleaved with parameters. The original SYA identifies an operator with a single token.

  2. Can have N parameters — a binary operator has exactly 2 operands. A Sheerka concept can have 0, 1, 2 or more parameters.

  3. Parameters can themselves be concepts — in 1 plus 2 times 3, the parameter b of plus is the result of the times concept. This recursion is handled by the nested workflow structure.

SYA ↔ Sheerka mapping

Original SYA Sheerka
Operator (+, *) ConceptToRecognize (concept with parameters)
Operand (number, variable) UnrecognizedToken or ConceptToken
Operator stack state_context.stack
Output queue state_context.parameters
Precedence rule InitConceptParsing.must_pop()
RPN result MetadataToken in state_context.result

Structural differences

Multi-token recognition — where SYA reads a single token to identify *, Sheerka must read long named concept (3 tokens) to identify concept a long named concept b. The ReadConcept state handles this sequential reading.

The expected structure — concept if a then b end is decomposed into segments:

[("if ", 0), (" then ", 1), (" end", 1)]
 ─────────    ──────────     ──────────
 keyword       keyword        keyword
 0 params      1 param        1 param
 before        before         before

Each segment states how many parameters precede it and which tokens to consume to validate it.

Precedence not yet implementedmust_pop() always returns False. Concept composition with precedence rules is the next implementation step.


Architecture

Two interdependent workflows

graph TD
    A[#tokens_wkf] -->|concept keyword found - fork| B[#concept_wkf]
    A -->|token not a concept keyword - buffered, loop| A
    B -->|concept fully parsed| A
    A -->|EOF| C[end]

The parser always starts in #tokens_wkf. Tokens that do not match any concept keyword are accumulated in a buffer and the loop continues. Whenever a token matching the first keyword of a known concept is detected, a fork is created and sent into #concept_wkf — the main path keeps looping independently. Once the concept is recognized, the fork returns to #tokens_wkf to continue reading.


Workflow #tokens_wkf

stateDiagram-v2
    [*] --> start
    start --> prepare_read_tokens
    prepare_read_tokens --> read_tokens
    read_tokens --> read_tokens : no concept found (loop)
    read_tokens --> eof : EOF
    read_tokens --> concepts_found : concept keyword detected (fork)
    eof --> end : ManageUnrecognized
    concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf
    end --> [*]

PrepareReadTokens: initializes the buffer and records buffer_start_pos.

ReadTokens: reads one token, calls get_metadata_from_first_token. If a concept can start at this token → fork with a cloned context where concept_to_recognize is set. The main path continues scanning.

ManageUnrecognized("concepts found"): processes the buffer accumulated before the keyword (via SimpleConceptsParser). Unrecognized tokens become UnrecognizedToken and are added to parameters.


Workflow #concept_wkf

stateDiagram-v2
    [*] --> start
    start --> init_concept_parsing
    init_concept_parsing --> manage_parameters

    manage_parameters --> read_concept

    read_concept --> read_parameters : more segments
    read_concept --> finalize_concept : all segments done
    read_concept --> token_mismatch : token mismatch
    read_concept --> error_eof : unexpected EOF

    read_parameters --> manage_parameters : loop
    read_parameters --> finalize_concept : EOF

    finalize_concept --> tokens_wkf : #tokens_wkf
    token_mismatch --> end
    error_eof --> end
    end --> [*]

InitConceptParsing:

  • Verifies the number of already-collected parameters is sufficient
  • Removes the first token of segment 0 (already consumed by ReadTokens)
  • Applies SYA: pushes the concept onto the stack

ReadConcept: reads the fixed tokens of the current segment one by one. If all match → pop(0) the segment and continue.

ReadParameters: reads ONE token into the buffer. Returns to ManageUnrecognized which tries to recognize it via SimpleConceptsParser.

FinalizeConceptParsing:

  • Pops the concept from the stack
  • Computes start (from the first parameter) and end (current position)
  • Creates MetadataToken(concept.metadata, start, end, resolution_method, "sya")
  • Clears stack and parameters
  • Returns to #tokens_wkf

Step-by-step example — "1 plus 2"

Concept: a plus b (variables a, b).

Tokens:

pos :  0    1      2       3    4    5
tok : "1"  " "  "plus"   " "  "2"  EOF

expected for this concept:

[([" ", "plus", " "], 1), ([], 1)]
   segment 0 → 1 param before, read " plus "
   segment 1 → 1 param before, read nothing (concept ends with a param)

Execution trace:

PrepareReadTokens  → buffer_start_pos = 0

ReadTokens "1"    → no concept, buffer = ["1"]
ReadTokens " "    → no concept, buffer = ["1", " "]
ReadTokens "plus" → concept "a plus b" found!

  ┌── FORK ─────────────────────────────────────────────────────┐
  │ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │
  └─────────────────────────────────────────────────────────────┘

ManageUnrecognized("concepts found")
  buffer = ["1"," "] → SimpleConceptsParser → not found
  parameters = [UnrecognizedToken("1 ", start=0, end=1)]
  buffer_start_pos = 3
  → #concept_wkf

InitConceptParsing
  expected[0] = ([" ","plus"," "], 1)
  need 1 param → have 1 ✓
  strip leading WS → ["plus"," "]
  pop "plus" (already consumed) → [" "]
  SYA: stack = [CTR(a_plus_b)]

ManageUnrecognized("manage parameters"): buffer empty → nothing

ReadConcept: reads [" "] → pos 3 = " " ✓
  expected.pop(0) → remaining = [([], 1)]
  → "read parameters"

ReadParameters: reads "2" at pos 4
  buffer = ["2"]
  → "manage parameters"

ManageUnrecognized("manage parameters")
  buffer = ["2"] → not a concept
  parameters = [UT("1 ", 0, 1), UT("2", 3, 3)]
  buffer_start_pos = 5

ReadConcept: expected = [([], 1)], reads 0 tokens
  expected.pop(0) → empty → "finalize concept"

FinalizeConceptParsing
  concept = stack.pop() = CTR(a_plus_b)
  start   = parameters[0].start = 0
  end     = parser_input.pos    = 4
  result.append(MetadataToken(metadata, 0, 4, "key", "sya"))
  → #tokens_wkf

ReadTokens → EOF → ManageUnrecognized("eof") → end

Result:

MultipleChoices([
    [MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")]
])

Example — sequence "1 plus 2 3 plus 7"

Same concept a plus b. The parser recognizes two concepts in one pass.

pos :  0    1      2       3    4    5    6    7      8       9    10   11
tok : "1"  " "  "plus"   " "  "2"  " "  "3"  " "  "plus"   " "  "7"  EOF

After FinalizeConceptParsing for the first concept (pos=4), #tokens_wkf restarts:

PrepareReadTokens   → buffer_start_pos = 5
ReadTokens " "      → buffer = [" "]
ReadTokens "3"      → buffer = [" ","3"]
ReadTokens " "      → buffer = [" ","3"," "]
ReadTokens "plus"   → fork

ManageUnrecognized  → UT(" 3 ", start=5, end=7), buffer_start_pos=9
...
FinalizeConceptParsing
  start = 5, end = 10
  result.append(MetadataToken(1001, 5, 10, "key", "sya"))

Final result (one path, two concepts):

MultipleChoices([
    [
        MetadataToken(1001, start=0,  end=4,  parser="sya"),
        MetadataToken(1001, start=5,  end=10, parser="sya"),
    ]
])

Future example — composition "1 plus 2 times 3"

Note: this example requires implementing must_pop(). Currently must_pop() always returns False.

Concepts: a plus b (low precedence), a times b (high precedence).

Expected behavior after implementation:

Expression:  1  plus  2  times  3

SYA with precedence times > plus:

Token "1"      → parameters = [1]            stack = []
Token "plus"   → stack = [plus]              parameters = [1]
Token "2"      → parameters = [1, 2]         stack = [plus]
Token "times"  → prec(times) > prec(plus) → no pop
                  stack = [plus, times]       parameters = [1, 2]
Token "3"      → parameters = [1, 2, 3]      stack = [plus, times]

Finalize:
  pop "times" → MetadataToken(times, params=[2, 3])
  pop "plus"  → MetadataToken(plus,  params=[1, times_result])

What must_pop() must implement:

def must_pop(self, current_concept, top_of_stack_concept):
    return precedence(top_of_stack_concept) >= precedence(current_concept)

Without this rule, both concepts are processed left-to-right with equal precedence, yielding (1 plus 2) times 3 instead of 1 plus (2 times 3).


The expected structure in detail

For concept if a then b end (key "if __var__0 then __var__1 end"):

_get_expected_tokens("if __var__0 then __var__1 end")

→ [
    (["if", " "],        0),   # read "if " before 1st param
    ([" ", "then", " "], 1),   # read " then " before 2nd param
    ([" ", "end"],       1),   # read " end" — 1 param before, no trailing param
  ]

During parsing, expected is modified in place:

  • InitConceptParsing removes the first token of segment 0 (already read by ReadTokens)
  • ReadConcept consumes the tokens of the current segment then calls pop(0)
  • When expected is empty → FinalizeConceptParsing

Key data structures

StateMachineContext

StateMachineContext
├── parser_input         ParserInput       token stream + cursor
├── other_parsers        [SimpleConceptsParser]
├── buffer               list[Token]       tokens pending classification
├── buffer_start_pos     int               start position of the current buffer
├── concept_to_recognize ConceptToRecognize | None
├── stack                list[CTR]         SYA — operator stack
├── parameters           list[UT|CT]       SYA — output queue
├── result               list[MetadataToken]
└── errors               list

MetadataToken (output)

MetadataToken
├── metadata          ConceptMetadata   (id, name, key, variables, ...)
├── start             int               position of the first token of the expression
├── end               int               position of the last token
├── resolution_method "key" | "name" | "id"
└── parser            "sya"

Token positions in "1 plus 2":

"1 plus 2"
 0 1  2   3 4
 │ │  │   │ │
 1 _  plus _ 2

MetadataToken: start=0, end=4

Differences vs SimpleConceptsParser

SimpleConceptsParser SyaConceptsParser
Target concepts No parameters With parameters
concept_wkf states 2 8
result contents MetadataToken + UnrecognizedToken MetadataToken only
Parameters N/A Collected in parameters list
Parser tag "simple" "sya"
SYA No Yes (precedence to implement)

Error handling

Error Cause State reached
UnexpectedToken Read token ≠ expected concept token TokenMismatchend
UnexpectedEof Input ends before concept is complete ErrorEofend
NotEnoughParameters Too few params before a segment Exception raised

Errors are collected from all paths and forwarded to error_sink in parse(). A path with errors is excluded from _select_best_paths.


Known limitations and proposed improvements

The current implementation correctly handles simple cases (single-token parameters, non-nested concepts). The following issues must be addressed before enabling precedence and real concept composition.

1. Parameters are limited to a single token

ReadParameters reads ONE token, then immediately calls ManageUnrecognized, which returns to ReadConcept to match the next keyword segment. Multi-token parameters therefore fail. For if hello world then foo end with parameter a = "hello world":

ReadParameters reads "hello"
ManageUnrecognized → UT("hello") → ReadConcept tries to match " then "
ReadConcept reads " " ✓  then "world" ≠ "then" → MISMATCH

Proposed fix: ReadParameters should accumulate tokens until it detects the start of the next keyword segment (lookahead on expected[0][0]), then hand the full buffer to ManageUnrecognized for parsing in one pass.


2. Flat parameters list with no arity tracking

When FinalizeConceptParsing runs, parameters is a flat list. There is no information about how many parameters belong to each concept on the stack. Once must_pop is active and multiple concepts are stacked, FinalizeConceptParsing cannot reconstruct the correct nesting.

Example: 1 plus 2 times 3 with stack = [plus, times] and parameters = [UT("1"), UT("2"), UT("3")]. Without arity information there is no way to determine that times consumes the last two parameters and plus consumes the first one and the result of times.

The arity of each concept (nb_variables) is available in expected at push time but is lost once expected is consumed during parsing.

Proposed fix: record the arity of each concept when it is pushed onto the stack (in apply_shunting_yard_algorithm). FinalizeConceptParsing then pops the correct number of parameters for each concept, from innermost to outermost, building intermediate MetadataToken objects that are re-injected into parameters as ConceptToken before processing the next concept on the stack.


3. Type mismatch in ManageUnrecognized for recognized parameters

When SimpleConceptsParser recognizes a token sequence, ManageUnrecognized creates:

state_context.parameters.append(
    ConceptToken(res.items[0], buffer_start_pos, parser_input.pos - 1)
)

res.items[0] is a list[MetadataToken] (one complete path from SimpleConceptsParser), but ConceptToken.concept is typed as Concept. Any downstream code that uses this ConceptToken will receive a list where it expects a Concept instance.

Proposed fix: define a dedicated container for a recognized parameter (e.g. ParsedParameterToken) that wraps a list[MetadataToken] with start/end positions, or flatten the result to a single MetadataToken when res.items[0] contains exactly one token.


4. Variable-to-parameter mapping not applied

FinalizeConceptParsing creates a MetadataToken without populating the concept's variables. parameters = [UT("1 "), UT("2")] maps positionally to variables = [("a", NotInit), ("b", NotInit)], but this mapping is never applied. The produced MetadataToken is therefore incomplete: a downstream evaluator has no way to retrieve parameter values from the token alone.

Proposed fix: in FinalizeConceptParsing, zip parameters with concept.metadata.variables and store the result in the MetadataToken's metadata, or pass it as a dedicated field.


5. SyaConceptsParser absent from other_parsers

other_parsers = [SimpleConceptsParser()]. A parameter can be a simple (parameter- less) concept, but never a composite concept with parameters. True composition — where a parameter is itself a SYA-parsed concept — is structurally impossible with the current design.

Proposed fix: add SyaConceptsParser to other_parsers. A guard is required to prevent infinite recursion: the nested instance should exclude the concept currently being recognized from its search space.


Priority order

# Issue Blocking
1 Multi-token parameters Practical usability
2 ConceptToken type mismatch Correctness
3 Variable-to-parameter mapping Evaluation pipeline
4 Arity not tracked on the stack must_pop / precedence
5 SyaConceptsParser absent from other_parsers Real composition

Issues 3 and 4 are interdependent with must_pop: implementing them independently (before activating precedence) is still valuable and lays the correct foundation.