Message-ID:

But maybe we don't really need that... -- Larry Wall in <199709011851.LAA07101@wall.org>

devel / comp.compilers / Re: Please provide a learning path for mastering lexical analysis languages

Hi Folks,

I want to master lexical analysis.

In Chris Christopher's last post he said:

> [Flex] is not even as powerful as some other lexical analysis languages
> and even exploiting its power often requires things that cannot be
> expressed in Flex alone nor can they be done in ways that are simple
> and easy to reason about.

I am currently in the process of mastering Flex. I feel that mastering Flex
will give me a foundation for learning other more advanced lexical analysis
languages. "This XYZ lexical analysis language does it this way, Flex does it
this other way. Ah, yes, I can see how XYZ's way is simpler and more
powerful."

From Chris's post I see that there is much to learn beyond Flex. Thank you
Chris.

Can you provide a learning path, please? A learning path for mastering lexical
analysis languages.

After I master Flex, what lexical analysis language should I then master? And
after mastering that, what is the next lexical analysis language that I should
master?

/Roger

On Fri, 6 May 2022 11:59:17 +0000, Roger L Costello
<costello@mitre.org> wrote:

>I want to master lexical analysis.
>
>In Chris Christopher's last post he said:
>
>> [Flex] is not even as powerful as some other lexical analysis languages
>> and even exploiting its power often requires things that cannot be
>> expressed in Flex alone nor can they be done in ways that are simple
>> and easy to reason about.
>
>I am currently in the process of mastering Flex. I feel that mastering Flex
>will give me a foundation for learning other more advanced lexical analysis
>languages. "This XYZ lexical analysis language does it this way, Flex does it
>this other way. Ah, yes, I can see how XYZ's way is simpler and more
>powerful."
>
>From Chris's post I see that there is much to learn beyond Flex. Thank you
>Chris.
>
>Can you provide a learning path, please? A learning path for mastering lexical
>analysis languages.

It's not the grammars (languages) used by the tools that you need to
learn - it's the methods used the tools that are important.

>After I master Flex, what lexical analysis language should I then master? And
>after mastering that, what is the next lexical analysis language that I should
>master?
>
>/Roger

Lexical analysis is not terribly interesting ... it really just is
about tokenizing the input. There are 2 main techniques: regular
expressions, and recursive descent code. In practice, these may be
combined.

Flex essentially is regex plus a pushdown stack. The stack offers
some limited ability to recognize simple 'inclusion' languages that
regex can't handle alone, but in the Chomsky sense it does not add
very much.

You can look at ANTLR for an example of RD lexing. There are others.

The main difference between regex and RD is that regex is an unguided
method which simply changes state in the match automaton in response
to incoming characters. In contrast, the incoming character in RD
guides the path taken through the matching code: e.g.,

if ('h' == ch)
if ('e' == ch)
if ('l' == ch)
if ('p' == ch)
{ // recognized 'help' }
else
if ('l' == ch)
:

In practice real RD code often matches whole words or word prefixes
rather than individual characters as shown here, but the principle is
the same. And sometimes regex are used to perform the word matches.

Grammatical syntax analysis - i.e. parsing - is where things get
interesting.

There are 3 typical methods of parsing: LL, LR, and PEG.

LL and PEG typically are associated with RD code. LR typically is
associated with table driven pushdown automata.
[I do recall seeing a paper in which the LR automata was /implemented/
using RD code in CPS, but naturally I can't find that paper just now.
It really just proves that there are more ways to implement FA than
many people realize 8-). I have not similarly seen an LL parser
implemented with table or graph FA/PDA, but I suspect that could be
done as well. Chris Clark probably knows better.]

You also may see references also to 'combinator' parsing, which some
people consider to be a unique method, but which more typically is
associated with PEG. IMO it is simply an implemenation technique that
is equally applicable to LL, but MMV on this point because - although
their implementation looks similar - there are some not-insignificant
technical differences between LL and PEG.

Bison is an LR based tool associated strongly with Flex. Bison offers
a few different variants of LR (LALR,SLR,GLR) that offer different
limitations on the recognized language and (where they overlap)
differences in the sizes of the generated parsers. Parser size was
much more important in the past - nowadays it's only relevant to
particular programming niches such as embedded.

ANTLR is an LL based tool that generates RD code for both parsers and
lexers. Since v3, ANTLR is LL(*) - prior to v3, ANTLR was LL(k). Both
LL(k) and LL(*) are interesting in their own right and you may wish to
look at both.
[Many LL tools only are LL(1). LL(k) and LL(*) both deal with
management of lookahead. LL(k) was created to deal with limitations
of LL(1), and LL(*) relieves the programmer of having to determine the
proper 'k' for the grammar.]

The biggest difference is in how you deal with recursive structure in
the language: in LL (or PEG) you need to right-factor your grammar, in
LR you need to left-factor instead. The factoring determines the
order in which recursive structures are recognized.

I personally have little experience with existing PEG tools ...
hopefully someone can give you some suggestions.

Combinator parsers usually are hand written using a library of
matching (or match creating) functions. Most often combinator parsers
are seen in [higher order] functional languages where the functions
easily can be strung together, but there are at least a few libraries
available for C and C++.

Hope this helps.
George

On Friday, May 6, 2022 at 11:13:52 AM UTC-5, Roger L Costello wrote:
> Hi Folks,
>
> I want to master lexical analysis.
>
> In Chris Christopher's last post he said:
>
> > [Flex] is not even as powerful as some other lexical analysis languages
> > and even exploiting its power often requires things that cannot be
> > expressed in Flex alone nor can they be done in ways that are simple
> > and easy to reason about.
>
> I am currently in the process of mastering Flex. I feel that mastering Flex
> will give me a foundation for learning other more advanced lexical analysis
> languages. "This XYZ lexical analysis language does it this way, Flex does it
> this other way. Ah, yes, I can see how XYZ's way is simpler and more
> powerful."
>
> From Chris's post I see that there is much to learn beyond Flex. Thank you
> Chris.
>
> Can you provide a learning path, please? A learning path for mastering lexical
> analysis languages.
>
> After I master Flex, what lexical analysis language should I then master? And
> after mastering that, what is the next lexical analysis language that I should
> master?
>
> /Roger

I'd like to echo George Neuner's remark:

> it's the methods used [by] the tools that are important.

In my opinion the best way to truly "master"* such tools is to step
away from the tools per se and delve into the algorithms and data
structures involved. The best book for this (again, in my opinion) is
Marvin Minsky's "Computation: Finite and Infinite Machines". He has a
whole chapter about Regular Languages and their association with
Finite State Automata and the algorithms for translating back and
forth between a Regular Expression and its corresponding Finite State
Automaton.

And to really get a handle on when to use flex vs bison vs whatever
other tool, you'll also need to know something about where Regular
Languages sit in the Chomsky Hierarchy of Phrase Structured Languages.
I'm not sure what the best resource is for this part. Chomsky's
writings on linguistics are formidable for the non-specialist (even
for very motivated amateurs) so I don't really recommend going
straight to the source. Maybe the intro chapter from the Dragon Book.
Or whatever Aho or Ullman has written on it.

* of course using my own private definition of mastery.

/* Token Rules */

<eof> -> \z

<constant> -> literal
-> integer
-> decimal
-> real

<identifier> -> letter (letter|digit)*

integer -> digit+

real -> integer exp
-> decimal exp

decimal -> digit+ '.'
-> '.' digit+
-> digit+ '.' digit+

exp -> 'e' digit+
-> 'E' digit+
-> 'e' '-' digit+
-> 'E' '-' digit+
-> 'e' '+' digit+
-> 'E' '+' digit+

literal -> ''' lchar '''

lchar -> lany
-> '\' '\'
-> '\' '''
-> '\' '"'
-> '\' 'n'
-> '\' 't'
-> '\' 'a'
-> '\' 'b'
-> '\' 'f'
-> '\' 'r'
-> '\' 'v'
-> '\' '0'

<string> -> '"' schar* '"'

schar -> sany
-> '\' '\'
-> '\' '''
-> '\' '"'
-> '\' 'n'
-> '\' 't'
-> '\' 'a'
-> '\' 'b'
-> '\' 'f'
-> '\' 'r'
-> '\' 'v'
-> '\' '0'

{whitespace} -> whitechar+

{commentline} -> '/' '/' neol*

{commentblock} -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

/* Character Sets */

any = 0..255 - \z
lany = any - ''' - '\' - \n
sany = any - '"' - '\' - \n

letter = 'a'..'z' | 'A'..'Z' | '_'
digit = '0'..'9'

whitechar = \t | \n | \r | \f | \v | ' '

na = any - '*' // not asterisk
nans = any - '*' - '/' // not asterisk not slash
neol = any - \n // not end of line

\t = 9 // tab
\n = 10 // newline
\v = 11 // vertical feed?
\f = 12 // form feed
\r = 13 // return
\z = 26 // end of file
\b = 32 // blank/space

/* End */

The above lexical rules define C-language symbols.
It's just a lexical grammar, not too hard to figure out.
This is input to the DFA lexer generator, which is provided
with the LRSTAR parser generator on SourceForge.net.

DFA creates lexers that run 80% faster than "flex" lexers
and are about the same size.

If you need more language power to define a lexer ...
that's what parser are for.

BTW, LRSTAR creates parsers in C++ than were running
140 times faster than those created by ANTLR, using the
C++ target, the last time I did a comparison, 2 years ago.

On Friday, May 6, 2022 at 9:13:52 AM UTC-7, Roger L Costello wrote:

> I want to master lexical analysis.

There is a story that there is someone at Microsoft whose only job is the "START" button in the corner.

Does a large compiler company, or compiler group of a large company, have one
person specializing in lexical analysis?

I would have thought that people would work on both lexical analysis
and syntax analysis (parsers), and not specialize in just one.

I might believe that optimization and code generation are different enough,
that someone could specialize in one or the other.

Compiler courses normally teach all parts of compiler writing,
and some compilers are written by one person, or a small group.

Much of the story from Brooks' "Mythical Man-month" is the problem of how
to divide up the work for a large software project. OS/360 was close to
the transition between writing OS in assembler, (including compilers), and writing
them in high-level languages.

There is the story (I believe from Brooks) about the OS/360 linkage editor,
the finest linker ever designed, with the ability to edit already linked load
modules. Just at the time that everyone recompiled programs every time,
and didn't need the ability to partially recompile them.
[As we've seen in recent discussions, the line between the lexer and
the parser can vary a lot and it doesn't make sense to think about one
without the other.

Brooks was referring to the static overlays the linker created, which were
indeed very lovely and were completely obsolete when large physical memory
and virtual addressing came along. "Mythical Man-Month", pages 56-57.
In about 1970 they added a loader that could load and go without writing
a linked module on disk. -John]

My apologies for this delayed reply. The startup I am working for is
nearing the alpha release and I had things on my plate that I needed
to give priority.

Let me start my answer with two simple rules of thumb, in order of importance.

1) Keep your lexer simple. Follow Einstein's famous suggestion (on a
slightly different topic). Make it as simple as possible, but no
simpler.

2) Respect the lexer/parser boundary. Try to avoid mixing the two
phases. That will help with item 1.

And to that regard, don't look for a more powerful lexing mechanism
when the job should be done in the parser. The exception is when it
makes the combination simpler (i.e. you make your lexer just a little
more complicated, but it makes the parser much simpler (or vice
versa).

So, I wouldn't go looking for lots of ways to make the lexer more
powerful. Don't go looking for a flame thrower to light your barbecue
grill.

If you are seriously interested in regular expressions, I would
recommend Friedl's book: Mastering Regular Expressions. That talks
about nearly all the variations at a detailed level. But realize that
the PCRE extensions (e.g. backreferences and zero-width assertions)
are not generally available in lexical analysis tools.

Those mechanisms are often overkill. In Snort, someone used them for
matching quotes around expressions. Simpler would have been to write
two rules, one for matched 's and one for matched "s. It probably
would have run faster too. But someone saw that Snort rules allowed
PCRE expressions and thought they would be clever.

That goes to point 1. Write simple rules for each token. And, it is
often better to have more rules that all generate the same token than
to try and make one complicated regular expression that somehow
combines them all. And, if you find yourself writing a complicated
regular expression or a list of more than 5 different rules for the
same token. Think twice about the problem you are solving. You might
be doing the wrong thing at the wrong level. (Worth noting here that
this point is true for language design. If your language design
starts forcing you to invent new technologies or use complicated
existing ones, ask if you are actually making the language simple to
use, often it's an indication of a mistake in your language design.
Now, if you are implementing an existing language, you may have no
choice, but realize that the original designer may have made a mistake
and you are just dealing with it.)

By the way, don't be afraid to borrow or steal good ideas and reuse
them. I have essentially written one lexer in Yacc++ and use it for
all languages with only minor tweaks. Now, I'm not trying to parse
Fortran 66 with it, but I have used it for Yacc++ itself, C/C++/C#
dialects, Verilog, Pascal, Javascript, BASIC, and lots of tiny tools.
I will likely use it for Rust with only the smallest of tweaks. I
reuse that lexer for nearly everything.

The structure of that lexer is quite simple.

There are only a small set of rules that are actual regular
expressions: identifiers, numbers (integers and floats distinguished),
strings, whitespace, and comments.

Punctuation is done by simple rules listing the legal "operators".
Note, if I had operators like "&" I would make a regular
expression for that in the first set.

Note that keywords I handle separately. This is based upon advice
Frank Deremer once gave. The lexer should be a scanner (regular
expression part) and a screener (looking for keywords and the likes).
Those are two separate tasks and splitting them often makes both
simpler. (In fact, it's one of my few gripes about ANTLR4, as it
doesn't make doing that easy. We built that model, split
scanner/screener, into Yacc++ and it was one of our better choices.
You can make a more powerful screener without making the scanner more
complex and vice-versa. That goes back to the idea of making things
simple by doing only one thing at a time.)

--------

Now, since I know you are working on HTML or something similar, I will
give one piece of advice where you want to use something a little more
powerful. In markdown type languages, you often have two distinct
portions, the markdown which is a little language that you want to
tokenize and the rest of the text which you don't want to tokenize (or
at least not the same way). That does suggest using lexer states (or
some similar mechanism). You are really writing two lexers. One for
inside markdown and another for the rest of the text. Don't try to
mix them. The markdown is usually clearly marked e.g. between < and >
or on lines that start with a special character (#) or something
similer. So, markdown is one lexer state (or complete lexer) with one
set of rules, while the rest of text is another lexer with a different
set of rules, and the only tricky part is knowing when to switch
between them (and in a good language, it is actually obvious).

But, notice how that is a different implementation of separation of
concerns. You have two lexers (or two lexer states). Each one doing
one job and doing it simply. The hard part is the switching between
them and that's not usually a complex regular expression, but the
requirement to do something that is more "infrastructure" related.
And, see how it answers your question about how to not lex "&" in
text outside of markdown. You are simply using a different lexer
[state]. It doesn't know about "&" and doesn't attempt to
tokenize it.

------

Since, I mentioned screeners above, I will talk a bit about them. The
simplest model is a function you can call upon matching an identifer
(or another token you want to screen, e.g. maybe you have comments
that can be "Javadocs" or decorations and you want to process them).
This function can look the token up (or lex and parse it) and decide
to change the token type (or even insert multiple tokens) before
handing the token from the lexer to the parser. Note, if the function
is complicated, you might be essentially implementing a preprocessor.
(So, "#.*\n" could be recognized as a token, but the screener for it
is the "C preprocessor"). And, the point here is that it is
sometimes useful to insert something between the lexer and the parser
(and maybe only for specific tokens.) Another use of a screener is
the C "typedef hack" where you look your identifier up in the symbol
table and see if it is a typedef and return a different token for it
in that case. That allows your screen to take feedback from the
parser without exposing it down to the scanner. Still, notice the
separation of concerns to keep each part simpler.

-----

Now, once you have a base simple lexer you are often done (or at least
mostly done). Then, you can worry about the "mistakes". Comment
processing is often a more complicated expression than one would like.
It's worse if you allow nested comments that need balanced delimiters,
e.g. "/*" and "*/". Balanced delimiters are a Dyck language and take
you out of the regular expression world. You can do them with lexer
states (if you can stack them), but it is easier if you simply have a
CFG language (LL, LR, or PEG) to do so. Captures and backreferences
can sometimes do it, provided they have a stack like quality.

We had a similar issue in Yacc++, we handle nested braces as a single
token "code". That makes the parser simpler. That token has matched
brackets, strings, and comments inside. It is much more complicated
than one really wants. It is essentially a mini-parser inside our
lexers. And, it means we can't really use braces as tokens
themselves. It's a sign that as language designers we make a mistake.
It probably would have been better to use a lexical state model, see
a brace "{" and go into a separate lexer to solve it. Sometimes
having a flame thrower is not the right solution.

------

A slightly different mistake was made in the original yacc design.
They make semicolons optional at the end of the rule. This means you
[sometimes] need "identifier:" to recognize the start of a rule. That
makes the lexer more complex. Other languages have played with
treating newline as a semicolon which has its own issue, but these are
often at the parser level not the lexer.

I'm sure I have more to say on this topic, but also I have said more
than enough....

Kind regards,
Chris

******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Subject	Author
Please provide a learning path for mastering lexical analysis languages	Roger L Costello
Re: Please provide a learning path for mastering lexical analysis languages	George Neuner
Re: Please provide a learning path for mastering lexical analysis languages	Paul B Mann
Re: Please provide a learning path for mastering lexical analysis languages	Christopher F Clark
Re: Please provide a learning path for mastering lexical analysis languages	luser droog
Re: Please provide a learning path for mastering lexical analysis languages	gah4