ConceptLexerParser can how handle UnrecognizedTokens

This commit is contained in:
2019-12-26 15:20:45 +01:00
parent bcb2308ea5
commit 26daae4acf
8 changed files with 483 additions and 125 deletions
+97 -1
View File
@@ -19,7 +19,7 @@ For those you don't know this old cartoon, it's the Odyssey story from Homer,
ported in the 31st century. Ulysses has a spacecraft with an AI named Shyrka
I was a great fan of this cartoon when I was young. I thought that the idea of
bringing the ancient story of Ulysses in the future was a bright.
bringing the ancient story of Ulysses in the future was bright.
Ever since then, Sheerka was my reference for any sophisticated computer. Unfortunately
for me, at that time there was no wikipedia to tell the the correct spelling.
@@ -654,3 +654,99 @@ For the two questions, I will first try the simple implementations and see there
* the entry in sdp will not be all_number, but all_id_of_number. I will use the concept id instead of its name
2019-24-12
**********
Going back on BNF implementation. As it's Christmas eve today, I won't stay very long.
So, the implementation lies in the class ConceptLexerParser, a it's a lexer not for token, but for concept.
The purpose of this class is to recognize a sequence of Concept.
So if we defines the following concepts
::
def concept foo from bnf one two three
def concept bar form bnf four five
when you input
::
one two three four five
the list of :code:`[foo, bar]` will be returned by the parser (as return values)
How does it works ?
As explained in the code, my implementation is highly inspired by Arpegio project. To define your grammar, you
use **ParsingExpressions**. There are several types
* some use to recognize tokens StrMatch, ConceptMatch
* other use to tell how to recognize Sequence, OrderedChoice, Optional, OneOrMore, ZeroOrMore...
Some example :
::
to recognize 'foo' -> StrMatch("foo')
to recognize 'foo bar' -> Sequence(StrMatch("foo'), StrMatch("bar'))
to recognize 'foo' or 'bar' -> OrderedChoice(StrMatch("foo'), StrMatch("bar'))
and so on...
So when a concept is defined using its bnf definition, I use the **BnfParser** to create the grammar, and then
I use the **ConceptLexerParser** to recognize the concepts
The current implementation to recognize a concept is not very efficient. All the definitions are in a dictionary
and I go thru the whole dictionary to see if some concepts are recognized. Once a concept is found, I loop again
on the whole dictionary to find the next concept.
| -> I need a btree to order the concept
| -> I need a predictive algorithm to guess the next concept
But it is for later.
So once the parsing is effective, I return a **ConceptNode** object
.. code-block:: python
class ConceptNode(LexerNode):
"""
Returned by the ConceptLexerParser
It represents a recognized concept
"""
def __init__(self, concept, start, end, tokens=None, source=None, underlying=None):
super().__init__(start, end, tokens, source)
self.concept = concept
self.underlying = underlying
if self.source is None:
self.source = BaseParser.get_text_from_tokens(self.tokens)
concept
| Remember that all grammars are listed in a dictionary of <Concept, ParsingExpression>.
| So when a parsing expression is verified, it's easy to link it with the concept
start
position first of the token
end
position of the last token
tokens
list of tokens that are recognized
underling
**NonTerminalNode** or **TerminalNode** that wraps the underlying **ParsingExpression** used to recognize the concept
source
| The source is deduced from the tokens
| But in the unit tests, they are directly given for speed up and simplicity
What is the difference between the **[Non]TerminalNode** and the **ParsingExpression** ?
The ParsingExpression
defines how to recognize a concept
The [Non]TerminalNode
represents what was found. So similarly to the ConceptNode, you will find the start, end and token attributes
That's all for today !