Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
592 views
in Technique[技术] by (71.8m points)

parsing - Is C#'s lambda expression grammar LALR(1)?

The question I wish to ask is succintly given in the title. Let me give an example of the grammar in question:

identifier_list
    : identifier
    | identifier_list identifier;

lambda_arguments
    : '(' identifier_list ')'
    | identifier;

lambda
    : lambda_arguments '=>' expression

Then we add in the normal C expression grammar- particularly,

primary_expression
    : '(' expression ')'
    | identifier
    | lambda;

The real question is, is this grammar LALR(1) parsable, i.e., capable of being parsed by automated parser generators? Or does it require a hand-rolled or GLR parser? Note that I wish to know specifically about this subsection, not the context-sensitive keywords stuff or any other section.

What I'm thinking right now is that if the parser sees '(' identifier ')', this has two valid parses, so if the parser sees identifier, looks ahead to ')', it won't be able to decide which parse tree to go down. This could just be a shift/reduce conflict, though, that I could eliminate by assigning some arbitrary precedence (probably favouriing '(' identifier ')' ).

Edit: Actually, I was considering stealing using this subsection of the grammar for a similar feature in a new language. I already have anonymous functions similar to JavaScript in grammatical form but my guinea pigs squeaks feedback complains they are too verbose for many uses, and pointed out the C# lambda expressions as a more ideal solution. I was concerned about potential ambiguity resulting from this solution. So, really, I was only interested in that subsection. The other stuff like generics and casts are non-issues for me.

Previous editions of my grammar are mechanically parsable and I wouldn't want to lose this property, and my previous experience with a mechanical generator tells me that it's best to check here rather than try myself. For my hand-rolled parser, I could of course simply special case '(' identifier to look ahead a bit further than normal.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First off, parser theory was always one of my weak points. I mostly work on semantic analyzers.

Second, all the C# parsers I've ever worked on have been hand-generated recursive descent parsers. One of my former colleagues who does have a strong background in parser theory did build his own parser generator and fed the C# grammar into it successfully, but I do not know what sort of egregious hacks doing so entailed.

So what I'm saying here is to take this answer with the appropriate amount of skepticism.


As you note, lambdas are slightly vexing because you've got to be careful about that parenthesized expression -- it might be a parenthesized expression, a cast operator or a lambda parameter list, and the lambda parameter list could be in several different forms. But all things considered, adding lambdas to C# 3.0 was relatively easy, grammatically; hacking up the parser was not too difficult -- it was the semantic analysis that was a bear for lambdas.

The real vexing problems in the C# grammar as far as look-ahead goes are generics and casts.

Generics were added in C# 2, after the language already had >>, > and < operators, all of which can cause weird problems when you throw generics into the mix.

The classic problem is of course A ( B < C, D > ( E ) ) Does the invocation of method A take two arguments: B < C and D > (E) or one, B<C,D>( E )?

The rule to disambiguate is:

If a sequence of tokens can be parsed as a simple-name, member-access, or pointer-member-access ending with a type-argument-list, the token immediately following the closing > token is examined. If it is one of ( ) ] : ; , . ? == != then the type-argument-list is retained as part of the simple-name, member-access or pointer-member-access and any other possible parse of the sequence of tokens is discarded. Otherwise, the type-argument-list is not considered part of the simple-name, member-access or pointer-member-access, even if there is no other possible parse of the sequence of tokens.

The second problem with the grammar goes back to C# 1.0, and that's the cast operator. The problem is that (x)-y could mean "cast -y to type x" or it could mean to subtract y from x. The rule here is:

A sequence of one or more tokens enclosed in parentheses is considered the start of a cast-expression only if at least one of the following are true:

The sequence of tokens is correct grammar for a type, but not for an expression.

The sequence of tokens is correct grammar for a type, and the token immediately following the closing parentheses is the token "~", the token "!", the token "(", an identifier, a literal, or any keyword except as and is.

The rules that disambiguate both cases involve potentially large look-aheads in theory, but in practice you very rarely have to back up the parser very far.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...