Fixing unit tests. Continuing SyaParser
18 KiB
SyaConceptsParser
Purpose
SyaConceptsParser parses sequences of concepts with parameters (variables).
It complements SimpleConceptsParser, which only handles parameter-less concepts.
Examples of recognized concepts:
a plus b→ matches1 plus 2,x plus y, etc.if a then b end→ matchesif x > 0 then print x enda long named concept b→ matches1 long named concept 2
The primary goal is concept composition: 1 plus 2 times 3, where times must
be evaluated before plus. This precedence problem is what the Shunting Yard
Algorithm solves.
The Shunting Yard Algorithm (SYA)
Dijkstra's algorithm (1961) converts an infix expression (1 + 2 * 3) into
Reverse Polish Notation (RPN: 1 2 3 * +), respecting operator precedence.
Principle
Two structures: an operator stack and an output queue.
Input: 1 + 2 * 3
┌──────────────────────────────────────────────┐
Token │ Action Stack Output │
─────────┼──────────────────────────────────────────────┤
1 │ → output queue [] [1] │
+ │ → stack [+] [1] │
2 │ → output queue [+] [1, 2] │
* │ prec(*) > prec(+) [+, *] [1, 2] │
│ → stack (no pop) │
3 │ → output queue [+, *] [1, 2, 3] │
end │ flush stack [] [1,2,3,*,+] │
└──────────────────────────────────────────────────────┘
RPN: 1 2 3 * + ≡ 1 + (2 * 3)
Pop rule
When pushing operator op, first pop any stack-top operator top where:
precedence(top) >= precedence(op)
This ensures higher-precedence operators are evaluated first.
Sheerka's Adaptation
The original SYA works on atomic tokens (digits, +, *).
Sheerka adapts it for concepts that:
-
Are identified by multiple tokens — a concept like
if a then b endhas several keywords (if,then,end) interleaved with parameters. The original SYA identifies an operator with a single token. -
Can have N parameters — a binary operator has exactly 2 operands. A Sheerka concept can have 0, 1, 2 or more parameters.
-
Parameters can themselves be concepts — in
1 plus 2 times 3, the parameterbofplusis the result of thetimesconcept. This recursion is handled by the nested workflow structure.
SYA ↔ Sheerka mapping
| Original SYA | Sheerka |
|---|---|
Operator (+, *) |
ConceptToRecognize (concept with parameters) |
| Operand (number, variable) | UnrecognizedToken or ConceptToken |
| Operator stack | state_context.stack |
| Output queue | state_context.parameters |
| Precedence rule | InitConceptParsing.must_pop() |
| RPN result | MetadataToken in state_context.result |
Structural differences
Multi-token recognition — where SYA reads a single token to identify *,
Sheerka must read long named concept (3 tokens) to identify concept
a long named concept b. The ReadConcept state handles this sequential reading.
The expected structure — concept if a then b end is decomposed into segments:
[("if ", 0), (" then ", 1), (" end", 1)]
───────── ────────── ──────────
keyword keyword keyword
0 params 1 param 1 param
before before before
Each segment states how many parameters precede it and which tokens to consume to validate it.
Precedence not yet implemented — must_pop() always returns False.
Concept composition with precedence rules is the next implementation step.
Architecture
Two interdependent workflows
graph TD
A[#tokens_wkf] -->|concept keyword found - fork| B[#concept_wkf]
A -->|token not a concept keyword - buffered, loop| A
B -->|concept fully parsed| A
A -->|EOF| C[end]
The parser always starts in #tokens_wkf. Tokens that do not match any concept
keyword are accumulated in a buffer and the loop continues. Whenever a token
matching the first keyword of a known concept is detected, a fork is created
and sent into #concept_wkf — the main path keeps looping independently. Once the
concept is recognized, the fork returns to #tokens_wkf to continue reading.
Workflow #tokens_wkf
stateDiagram-v2
[*] --> start
start --> prepare_read_tokens
prepare_read_tokens --> read_tokens
read_tokens --> read_tokens : no concept found (loop)
read_tokens --> eof : EOF
read_tokens --> concepts_found : concept keyword detected (fork)
eof --> end : ManageUnrecognized
concepts_found --> concept_wkf : ManageUnrecognized → #concept_wkf
end --> [*]
PrepareReadTokens: initializes the buffer and records buffer_start_pos.
ReadTokens: reads one token, calls get_metadata_from_first_token. If a concept
can start at this token → fork with a cloned context where concept_to_recognize
is set. The main path continues scanning.
ManageUnrecognized("concepts found"): processes the buffer accumulated before
the keyword (via SimpleConceptsParser). Unrecognized tokens become
UnrecognizedToken and are added to parameters.
Workflow #concept_wkf
stateDiagram-v2
[*] --> start
start --> init_concept_parsing
init_concept_parsing --> manage_parameters
manage_parameters --> read_concept
read_concept --> read_parameters : more segments
read_concept --> finalize_concept : all segments done
read_concept --> token_mismatch : token mismatch
read_concept --> error_eof : unexpected EOF
read_parameters --> manage_parameters : loop
read_parameters --> finalize_concept : EOF
finalize_concept --> tokens_wkf : #tokens_wkf
token_mismatch --> end
error_eof --> end
end --> [*]
InitConceptParsing:
- Verifies the number of already-collected parameters is sufficient
- Removes the first token of segment 0 (already consumed by
ReadTokens) - Applies SYA: pushes the concept onto the stack
ReadConcept: reads the fixed tokens of the current segment one by one.
If all match → pop(0) the segment and continue.
ReadParameters: reads ONE token into the buffer. Returns to
ManageUnrecognized which tries to recognize it via SimpleConceptsParser.
FinalizeConceptParsing:
- Pops the concept from the stack
- Computes
start(from the first parameter) andend(current position) - Creates
MetadataToken(concept.metadata, start, end, resolution_method, "sya") - Clears stack and parameters
- Returns to
#tokens_wkf
Step-by-step example — "1 plus 2"
Concept: a plus b (variables a, b).
Tokens:
pos : 0 1 2 3 4 5
tok : "1" " " "plus" " " "2" EOF
expected for this concept:
[([" ", "plus", " "], 1), ([], 1)]
segment 0 → 1 param before, read " plus "
segment 1 → 1 param before, read nothing (concept ends with a param)
Execution trace:
PrepareReadTokens → buffer_start_pos = 0
ReadTokens "1" → no concept, buffer = ["1"]
ReadTokens " " → no concept, buffer = ["1", " "]
ReadTokens "plus" → concept "a plus b" found!
┌── FORK ─────────────────────────────────────────────────────┐
│ clone: buffer=["1"," "], pos=2, concept_to_recognize=CTR(+) │
└─────────────────────────────────────────────────────────────┘
ManageUnrecognized("concepts found")
buffer = ["1"," "] → SimpleConceptsParser → not found
parameters = [UnrecognizedToken("1 ", start=0, end=1)]
buffer_start_pos = 3
→ #concept_wkf
InitConceptParsing
expected[0] = ([" ","plus"," "], 1)
need 1 param → have 1 ✓
strip leading WS → ["plus"," "]
pop "plus" (already consumed) → [" "]
SYA: stack = [CTR(a_plus_b)]
ManageUnrecognized("manage parameters"): buffer empty → nothing
ReadConcept: reads [" "] → pos 3 = " " ✓
expected.pop(0) → remaining = [([], 1)]
→ "read parameters"
ReadParameters: reads "2" at pos 4
buffer = ["2"]
→ "manage parameters"
ManageUnrecognized("manage parameters")
buffer = ["2"] → not a concept
parameters = [UT("1 ", 0, 1), UT("2", 3, 3)]
buffer_start_pos = 5
ReadConcept: expected = [([], 1)], reads 0 tokens
expected.pop(0) → empty → "finalize concept"
FinalizeConceptParsing
concept = stack.pop() = CTR(a_plus_b)
start = parameters[0].start = 0
end = parser_input.pos = 4
result.append(MetadataToken(metadata, 0, 4, "key", "sya"))
→ #tokens_wkf
ReadTokens → EOF → ManageUnrecognized("eof") → end
Result:
MultipleChoices([
[MetadataToken(id="1001", start=0, end=4, resolution_method="key", parser="sya")]
])
Example — sequence "1 plus 2 3 plus 7"
Same concept a plus b. The parser recognizes two concepts in one pass.
pos : 0 1 2 3 4 5 6 7 8 9 10 11
tok : "1" " " "plus" " " "2" " " "3" " " "plus" " " "7" EOF
After FinalizeConceptParsing for the first concept (pos=4), #tokens_wkf restarts:
PrepareReadTokens → buffer_start_pos = 5
ReadTokens " " → buffer = [" "]
ReadTokens "3" → buffer = [" ","3"]
ReadTokens " " → buffer = [" ","3"," "]
ReadTokens "plus" → fork
ManageUnrecognized → UT(" 3 ", start=5, end=7), buffer_start_pos=9
...
FinalizeConceptParsing
start = 5, end = 10
result.append(MetadataToken(1001, 5, 10, "key", "sya"))
Final result (one path, two concepts):
MultipleChoices([
[
MetadataToken(1001, start=0, end=4, parser="sya"),
MetadataToken(1001, start=5, end=10, parser="sya"),
]
])
Future example — composition "1 plus 2 times 3"
Note: this example requires implementing
must_pop(). Currentlymust_pop()always returnsFalse.
Concepts: a plus b (low precedence), a times b (high precedence).
Expected behavior after implementation:
Expression: 1 plus 2 times 3
SYA with precedence times > plus:
Token "1" → parameters = [1] stack = []
Token "plus" → stack = [plus] parameters = [1]
Token "2" → parameters = [1, 2] stack = [plus]
Token "times" → prec(times) > prec(plus) → no pop
stack = [plus, times] parameters = [1, 2]
Token "3" → parameters = [1, 2, 3] stack = [plus, times]
Finalize:
pop "times" → MetadataToken(times, params=[2, 3])
pop "plus" → MetadataToken(plus, params=[1, times_result])
What must_pop() must implement:
def must_pop(self, current_concept, top_of_stack_concept):
return precedence(top_of_stack_concept) >= precedence(current_concept)
Without this rule, both concepts are processed left-to-right with equal precedence,
yielding (1 plus 2) times 3 instead of 1 plus (2 times 3).
The expected structure in detail
For concept if a then b end (key "if __var__0 then __var__1 end"):
_get_expected_tokens("if __var__0 then __var__1 end")
→ [
(["if", " "], 0), # read "if " before 1st param
([" ", "then", " "], 1), # read " then " before 2nd param
([" ", "end"], 1), # read " end" — 1 param before, no trailing param
]
During parsing, expected is modified in place:
InitConceptParsingremoves the first token of segment 0 (already read byReadTokens)ReadConceptconsumes the tokens of the current segment then callspop(0)- When
expectedis empty →FinalizeConceptParsing
Key data structures
StateMachineContext
StateMachineContext
├── parser_input ParserInput token stream + cursor
├── other_parsers [SimpleConceptsParser]
├── buffer list[Token] tokens pending classification
├── buffer_start_pos int start position of the current buffer
├── concept_to_recognize ConceptToRecognize | None
├── stack list[CTR] SYA — operator stack
├── parameters list[UT|CT] SYA — output queue
├── result list[MetadataToken]
└── errors list
MetadataToken (output)
MetadataToken
├── metadata ConceptMetadata (id, name, key, variables, ...)
├── start int position of the first token of the expression
├── end int position of the last token
├── resolution_method "key" | "name" | "id"
└── parser "sya"
Token positions in "1 plus 2":
"1 plus 2"
0 1 2 3 4
│ │ │ │ │
1 _ plus _ 2
MetadataToken: start=0, end=4
Differences vs SimpleConceptsParser
SimpleConceptsParser |
SyaConceptsParser |
|
|---|---|---|
| Target concepts | No parameters | With parameters |
concept_wkf states |
2 | 8 |
result contents |
MetadataToken + UnrecognizedToken |
MetadataToken only |
| Parameters | N/A | Collected in parameters list |
| Parser tag | "simple" |
"sya" |
| SYA | No | Yes (precedence to implement) |
Error handling
| Error | Cause | State reached |
|---|---|---|
UnexpectedToken |
Read token ≠ expected concept token | TokenMismatch → end |
UnexpectedEof |
Input ends before concept is complete | ErrorEof → end |
NotEnoughParameters |
Too few params before a segment | Exception raised |
Errors are collected from all paths and forwarded to error_sink in parse().
A path with errors is excluded from _select_best_paths.
Known limitations and proposed improvements
The current implementation correctly handles simple cases (single-token parameters, non-nested concepts). The following issues must be addressed before enabling precedence and real concept composition.
1. Parameters are limited to a single token
ReadParameters reads ONE token, then immediately calls ManageUnrecognized, which
returns to ReadConcept to match the next keyword segment. Multi-token parameters
therefore fail. For if hello world then foo end with parameter a = "hello world":
ReadParameters reads "hello"
ManageUnrecognized → UT("hello") → ReadConcept tries to match " then "
ReadConcept reads " " ✓ then "world" ≠ "then" → MISMATCH
Proposed fix: ReadParameters should accumulate tokens until it detects the
start of the next keyword segment (lookahead on expected[0][0]), then hand the
full buffer to ManageUnrecognized for parsing in one pass.
2. Flat parameters list with no arity tracking
When FinalizeConceptParsing runs, parameters is a flat list. There is no
information about how many parameters belong to each concept on the stack. Once
must_pop is active and multiple concepts are stacked, FinalizeConceptParsing
cannot reconstruct the correct nesting.
Example: 1 plus 2 times 3 with stack = [plus, times] and
parameters = [UT("1"), UT("2"), UT("3")]. Without arity information there is no
way to determine that times consumes the last two parameters and plus consumes
the first one and the result of times.
The arity of each concept (nb_variables) is available in expected at push time
but is lost once expected is consumed during parsing.
Proposed fix: record the arity of each concept when it is pushed onto the stack
(in apply_shunting_yard_algorithm). FinalizeConceptParsing then pops the correct
number of parameters for each concept, from innermost to outermost, building
intermediate MetadataToken objects that are re-injected into parameters as
ConceptToken before processing the next concept on the stack.
3. Type mismatch in ManageUnrecognized for recognized parameters
When SimpleConceptsParser recognizes a token sequence, ManageUnrecognized
creates:
state_context.parameters.append(
ConceptToken(res.items[0], buffer_start_pos, parser_input.pos - 1)
)
res.items[0] is a list[MetadataToken] (one complete path from
SimpleConceptsParser), but ConceptToken.concept is typed as Concept. Any
downstream code that uses this ConceptToken will receive a list where it expects a
Concept instance.
Proposed fix: define a dedicated container for a recognized parameter (e.g.
ParsedParameterToken) that wraps a list[MetadataToken] with start/end positions,
or flatten the result to a single MetadataToken when res.items[0] contains
exactly one token.
4. Variable-to-parameter mapping not applied
FinalizeConceptParsing creates a MetadataToken without populating the concept's
variables. parameters = [UT("1 "), UT("2")] maps positionally to
variables = [("a", NotInit), ("b", NotInit)], but this mapping is never applied.
The produced MetadataToken is therefore incomplete: a downstream evaluator has no
way to retrieve parameter values from the token alone.
Proposed fix: in FinalizeConceptParsing, zip parameters with
concept.metadata.variables and store the result in the MetadataToken's metadata,
or pass it as a dedicated field.
5. SyaConceptsParser absent from other_parsers
other_parsers = [SimpleConceptsParser()]. A parameter can be a simple (parameter-
less) concept, but never a composite concept with parameters. True composition —
where a parameter is itself a SYA-parsed concept — is structurally impossible with
the current design.
Proposed fix: add SyaConceptsParser to other_parsers. A guard is required
to prevent infinite recursion: the nested instance should exclude the concept
currently being recognized from its search space.
Priority order
| # | Issue | Blocking |
|---|---|---|
| 1 | Multi-token parameters | Practical usability |
| 2 | ConceptToken type mismatch |
Correctness |
| 3 | Variable-to-parameter mapping | Evaluation pipeline |
| 4 | Arity not tracked on the stack | must_pop / precedence |
| 5 | SyaConceptsParser absent from other_parsers |
Real composition |
Issues 3 and 4 are interdependent with must_pop: implementing them independently
(before activating precedence) is still valuable and lays the correct foundation.