ConceptLexerParser can how handle UnrecognizedTokens

2019-12-26 15:20:45 +01:00
parent bcb2308ea5
commit 26daae4acf
8 changed files with 483 additions and 125 deletions
@@ -19,7 +19,7 @@ For those you don't know this old cartoon, it's the Odyssey story from Homer,
 ported in the 31st century. Ulysses has a spacecraft with an AI named Shyrka

 I was a great fan of this cartoon when I was young. I thought that the idea of
-bringing the ancient story of Ulysses in the future was a bright.
+bringing the ancient story of Ulysses in the future was bright.

 Ever since then, Sheerka was my reference for any sophisticated computer. Unfortunately
 for me, at that time there was no wikipedia to tell the the correct spelling.
@@ -654,3 +654,99 @@ For the two questions, I will first try the simple implementations and see there
 * the entry in sdp will not be all_number, but all_id_of_number. I will use the concept id instead of its name


+2019-24-12
+**********
+
+Going back on BNF implementation. As it's Christmas eve today, I won't stay very long.
+
+So, the implementation lies in the class ConceptLexerParser, a it's a lexer not for token, but for concept.
+The purpose of this class is to recognize a sequence of Concept.
+
+So if we defines the following concepts
+
+::
+
+    def concept foo from bnf one two three
+    def concept bar form bnf four five
+
+when you input
+
+::
+
+    one two three four five
+
+the list of :code:`[foo, bar]` will be returned by the parser (as return values)
+
+How does it works ?
+
+As explained in the code, my implementation is highly inspired by Arpegio project. To define your grammar, you
+use **ParsingExpressions**. There are several types
+
+* some use to recognize tokens StrMatch, ConceptMatch
+* other use to tell how to recognize Sequence, OrderedChoice, Optional, OneOrMore, ZeroOrMore...
+
+Some example :
+
+::
+
+    to recognize 'foo' -> StrMatch("foo')
+    to recognize 'foo bar' -> Sequence(StrMatch("foo'), StrMatch("bar'))
+    to recognize 'foo' or  'bar' -> OrderedChoice(StrMatch("foo'), StrMatch("bar'))
+
+    and so on...
+
+So when a concept is defined using its bnf definition, I use the **BnfParser** to create the grammar, and then
+I use the **ConceptLexerParser** to recognize the concepts
+
+The current implementation to recognize a concept is not very efficient. All the definitions are in a dictionary
+and I go thru the whole dictionary to see if some concepts are recognized. Once a concept is found, I loop again
+on the whole dictionary to find the next concept.
+
+| -> I need a btree to order the concept
+| -> I need a predictive algorithm to guess the next concept
+
+But it is for later.
+
+So once the parsing is effective, I return a **ConceptNode** object
+
+.. code-block:: python
+
+    class ConceptNode(LexerNode):
+        """
+        Returned by the ConceptLexerParser
+        It represents a recognized concept
+        """
+
+        def __init__(self, concept, start, end, tokens=None, source=None, underlying=None):
+            super().__init__(start, end, tokens, source)
+            self.concept = concept
+            self.underlying = underlying
+
+            if self.source is None:
+                self.source = BaseParser.get_text_from_tokens(self.tokens)
+
+
+concept
+    | Remember that all grammars are listed in a dictionary of <Concept, ParsingExpression>.
+    | So when a parsing expression is verified, it's easy to link it with the concept
+start
+    position first of the token
+end
+    position of the last token
+tokens
+    list of tokens that are recognized
+underling
+    **NonTerminalNode** or **TerminalNode** that wraps the underlying **ParsingExpression** used to recognize the concept
+source
+    | The source is deduced from the tokens
+    | But in the unit tests, they are directly given for speed up and simplicity
+
+What is the difference between the **[Non]TerminalNode** and the **ParsingExpression** ?
+
+The ParsingExpression
+    defines how to recognize a concept
+
+The [Non]TerminalNode
+    represents what was found. So similarly to the ConceptNode, you will find the start, end and token attributes
+
+That's all for today !