Rectangle 27 3

If I run your grammar with ANTLR 4.6, it does not stop parsing, it parses the whole file and displays in pink what the parser can't match :

And there is an important error message :

line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}

As I explain here as soon as you have a "mismatched" error, add -tokens to grun.

"column a"*"column b"
"column a" * "column b"
$ grun Formula expr -tokens -diagnostics t2.text
[@0,0:0='"',<'"'>,1:0]
[@1,1:8='column a',<COLUMN>,1:1]
[@2,9:9='"',<'"'>,1:9]
[@3,10:12=' * ',<COLUMN>,1:10]
[@4,13:13='"',<'"'>,1:13]
[@5,14:21='column b',<COLUMN>,1:14]
[@6,22:22='"',<'"'>,1:22]
[@7,24:23='<EOF>',<EOF>,2:0]
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}

you immediately see that " * "is interpreted as COLUMN.

Many questions about matching input with lexer rules have been asked these last days :

So many times that Lucas has posted a false question just to make an answer which summarizes all that problematic : disambiguate.

grammar - Antlr4 unexpectedly stops parsing expression - Stack Overflo...

grammar antlr4
Rectangle 27 17

The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.

Mod up for mentioning ANTLR. I had looked at this a little while back, but forgot about using it as a lex/yacc replacement. If the C/C++ grammar is good, this may be my favorite path...

I don't know why this is the accepted answer now, or why it was accepted originally. The ANTLR grammar for C++ has never been used in practice, as far as I know and I keep track of stuff like this. The author of the grammar left footprints in the docs saying, "Its incomplete, I'm done with it, you can patch it up if you want". C++98 is a tough language, and C++11 is worse, and then there's a bunch of dialects (GCC, Microsoft, Sun, ...). If you don't have the parser right, what you have is just useless. Then you need full name and type resolution to do anything. Nothing here for that.

Good tools for creating a C/C++ parser/analyzer - Stack Overflow

c++ c parsing yacc lex
Rectangle 27 9

There are a lot of advantages to using a parser generator like bison or antlr, particularly while you're developing a language. You'll undoubtedly end up making changes to the grammar as you go, and you'll want to end up with documentation of the final grammar. Tools which produce a grammar automatically from the documentation are really useful. They also can help give you confidence that the grammar of the language is (a) what you think it is and (b) not ambiguous.

If your language (unlike C++) is actually LALR(1), or even better, LL(1), and you're using LLVM tools to build the AST and IR, then it's unlikely that you will need to do much more than write down the grammar and provide a few simple actions to build the AST. That will keep you going for a while.

The usual reason that people eventually choose to build their own parsers, other than the "real programmers don't use parser generators" prejudice, is that it's not easy to provide good diagnostics for syntax errors, particularly with LR(1) parsing. If that's one of your goals, you should try to make your grammar LL(k) parseable (it's still not easy to provide good diagnostics with LL(k), but it seems to be a little easier) and use an LL(k) framework like Antlr.

There is another strategy, which is to first parse the program text in the simplest possible way using an LALR(1) parser, which is more flexible than LL(1), without even trying to provide diagnostics. If the parse fails, you can then parse it again using a slower, possibly even backtracking parser, which doesn't know how to generate ASTs, but does keep track of source location and try to recover from syntax errors. Recovering from syntax erros without invalidating the AST is even more difficult than just continuing to parse, so there's a lot to be said for not trying. Also, keeping track of source location is really slow, and it's not very useful if you don't have to produce diagnostics (unless you need it for adding debugging annotations), so you can speed the parse up quite a bit by not bothering with location tracking.

Personally, I'm biased against packrat parsing, because it's not clear what the actual language parsed by a PEG is. Other people don't mind this so much, and YMMV.

Why it is "not clear" what is the actual language? PEG is well-defined, even with all the cool hacks that packrat allows to do (high-order parsing and such).

@SK-logic: well-defined is not the same as clear. A hand-crafted parser written in C++ is well-defined. A Turing machine is well-defined. Yes, PEG is well-defined. But for all of them, the only way to see if a given string is in the language is to execute the code. (Of those three alternatives, PEG is the least bad, imo. But I still prefer formal context free grammars. However, as I said, other people like PEG, and whatever works for you is cool with me.)

From my practical experience, PEGs are the most clear and easy to read grammars. I can translate a language spec straight into a PEG with very little modifications. It is possible to obfuscate it, of course, but I have not seen a really bad grammar yet. Whereas there are many unreadable beyond any hope Yacc grammars.

c++ - LLVM JIT Parser writing with Bison / Antlr / Packrat / Elkhound ...

c++ parsing compiler-construction llvm bison
Rectangle 27 3

I suggest http://www.canonware.com/Parsing/, since it is pure python and you don't need to learn a grammar, but it isn't widely used, and has comparatively little documentation. The heavyweight is ANTLR and PyParsing. ANTLR can generate java and C++ parsers too, and AST walkers but you will have to learn what amounts to a new language.

Resources for lexing, tokenising and parsing in python - Stack Overflo...

python parsing resources lex
Rectangle 27 2

You can also implement the DFA using JavaCC or Antlr like parser libraries. These libraries help in parsing the language grammar and building AST.

If you can model your DFA states as a set of acceptable grammar, you can use these libraries.

Actually I am creating a JavaCode Generator for regular expressions, so far I have done two possible approaches (if-else, and graph approach) but I want to provide more possible ways. I think maybe it can be implemented using a some data structures as Set or Map for transitions.

java - Which are the best ways to implement a DFA? - Stack Overflow

java finite-automata dfa automata
Rectangle 27 2

I'd go with antlr (and actually I go for parsing Java). It supports a lot of languages and also has a lot of example grammars that you get for free http://www.antlr.org/grammar/list. Unfortunately they don't have to be perfect (the Java grammar has no AST rules) but they give you a good start and I suppose the community is quite big for a parser generator.

The great thing with antlr apart from the many language targets is that LL(*) combinded with the predicates supported by antlr is very powerful a easy to understand and the generated parsers are too.

With "extendable to multiple languages" I suppose you mean multiple source languages. This isn't easy but I suppose you might have some success when translating them to ASTs that have as much common symbols as possible and writing a general tree walker that can handle the differences in those languages. But this could be quite difficult.

Be warned, though, that the online documentation is only good once you've read the official antlr book and understand LL(*) and semantic and syntactic predicates.

parsing - Best way to tokenize and parse programming languages in my a...

programming-languages parsing lexer
Rectangle 27 1

For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.

Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).

c++ - Finite State Machine parser - Stack Overflow

c++ design parsing stream fsm
Rectangle 27 1

The canonical parser generator is called yacc. There's a gnu version of it called bison. These are both C based tools, so they should integrate nicely with your C++ code. There is a tool for java called ANTLR which I've heard very good things about (i.e. it's easy to use and powerful). Keep in mind that with yacc or bison you will have to write a grammar in their language. This is most certainly doable, but not always easy. It's important to have a theoretical background in LR(k) parsing so you can understand that it means when it tells you to fix your ambiguous grammar.

c++ - How to turn type-labeled tokens into a parse-tree? - Stack Overf...

c++ compiler-construction parsing
Rectangle 27 1

If you're building a complex programming language, you should strongly consider using a parser generator like bison or ANTLR to do the parsing for you. The advantage of such tools is that you can just describe what the rules of your language are, along with what to do when such rules are found, and the tool will automatically generate the parsing code for you.

bison supports bottom-up parsers in the LR family: LALR(1), LR(1), GLR(1), and the new IELR(1) algorithms. These capture a large family of languages, but you need to know a bit about the parsing algorithm in order to fix some of the errors you might encounter (namely, shift/reduce and reduce/reduce).

ANTLR uses LL(*) parsers, which capture a slightly smaller set of languages but tends to work beautifully on many programming languages.

If you insist on rolling your own parser, then you can actually implement the above algorithms by hand, but it's extremely difficult. The easiest option is to use a top-down recursive descent parser with backtracking, or to jiggle the grammar until it is LL(1) and then use a simple top-down, non-backtracking parser. That said, I think you are making things much harder than they need to be.

+1 for the IELR reference. Always good when an old dog learns new tricks.

After doing a lot of googling about Bison, ANTRl, LL(*) Parsers, etc it seems that they all work on contex-free grammars, and I am fairly certain that my language isn't context free, although I'm not quite sure, as I'm still working to decipher all these wikipedia pages....

@zacaj- Most programming languages aren't context-free, but they can be thought of as a context-free language with some extra constraints representing things like scoping, etc. The language you've described certainly looks context-free enough to be parsed with these algorithms.

parsing - How to parse complex function calls in custom language - Sta...

parsing programming-languages
Rectangle 27 1

Maybe you are missing out on ANTLR, which is good for languages that can be defined with a recursive-descent parsing strategy.

There are potentially some advantages to using Yacc/Lex, but it is not mandatory to use them. There are some downsides to using Yacc/Lex too, but the advantages usually outweigh the disadvantages. In particular, it is often easier to maintain a Yacc-driven grammar than a hand-coded one, and you benefit from the automation that Yacc provides.

However, writing your own parser from scratch is not an evil thing to do. It may make it harder to maintain in future, but it may make it easier, too.

programming languages - yacc/lex or hand-coding? - Stack Overflow

programming-languages yacc
Rectangle 27 0

Of course older techniques are still common (e.g. using Flex and Bison) many newer language implementations combine the lexing and parsing phase, by using a parser based on a parsing expression grammar (PEG). This works for recursive descent parsers created using combinators, or memoizing Packrat parsers. Many compilers are built using the Antlr framework also.

How to create a language these days? - Stack Overflow

programming-languages language-design compiler-construction
Rectangle 27 0

Of course older techniques are still common (e.g. using Flex and Bison) many newer language implementations combine the lexing and parsing phase, by using a parser based on a parsing expression grammar (PEG). This works for recursive descent parsers created using combinators, or memoizing Packrat parsers. Many compilers are built using the Antlr framework also.

How to create a language these days? - Stack Overflow

programming-languages language-design compiler-construction