Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
630 views
in Technique[技术] by (71.8m points)

streaming - Can ANTLR4 java parser handle very large files or can it stream files

Is the java parser generated by ANTLR capable of streaming arbitrarily large files?

I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.

    new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
    CommonTokenStream stream = new CommonTokenStream(lexer);
    Parser parser = new Parser(stream);

I basically have a file I exported from hadoop using pig. It has a large number of rows separated by ' '. Each column is split by a ''. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.

When I look at the generated parser code I could call it like so

    parser.lines().line()

This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.

Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You could do it like this:

InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();

And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...