Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
730 views
in Technique[技术] by (71.8m points)

php - recursively parse custom markup

I must handle an already existing custom markup language (which is ugly, but unfortunately can not be altered because I'm handling legacy data and it needs to stay compatible with a legacy app).

I need to parse command "ranges", and depending on the action taken by the user either replace these "ranges" in the data with something else (HTML or LaTeX code) or entirely remove these "ranges" from the input.

My current solution solution is using preg_replace_callback() in a loop until there are no matches left, but it is utterly slow for huge documents. (i.e. ~7 seconds for 394 replacements in a 57 KB document)

Recursive regular expressions don't seem to be flexible enough for this task, as i need to access all matches, even in recursion.

Question: How could i improve the performance of my parsing?

Regular expressions may be completely removed - they are not a requirement but the only thing i could come up with.

Note: The code example below is heavily reduced. (SSCCE) Actually there are many different "types" of ranges and the closure function does different things depending on the mode of operation. (insert values from DB, remove entire ranges, convert to another format, etc..) Please keep this in mind!

Example of what I'm currently doing:

<?php
$data = <<<EOF
some text 1
begin-command
    some text 2
    begin-command
        some text 3
    command-end
    some text 4
    begin-command-if "%VAR%" == "value"
        some text 5
        begin-command
            some text 6
        command-end
    command-end
command-end

EOF;

$regex = '~
    # opening tag
    begin-(?P<type>command(?:-if)?)
    # must not contain a nested "command" or "command-if" command!
    (?!.*begin-command(?:-if)?.*command(?:-if)?-end)
    # the parameters for "command-if" are optional
    (?:
        [s
]*?
        (?:")[s
]*(?P<leftvalue>[^\\]*?)[s
]*(?:")
        [s
]*
        # the operator is optional
        (?P<operator>[=<>!]*)
        [s
]*
        (?:")[s
]*(?P<rightvalue>[^\\]*?)[s
]*(?:")
        [s
]*?
    )?
    # the real content
    (?P<content>.*?)
    # closing tag
    command(?:-if)?-end
 ~smx';

$counter = 0;
$loop_replace = true;
while ($loop_replace) {
    $data = preg_replace_callback($regex, function ($matches) use ($counter) {
        global $counter;
        $counter++;
        return "<command id='{$counter}'>{$matches['content']}</command>";
    }, $data, -1, $loop_replace);
}
echo $data;
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your look ahead on the 4th line of your regex:

(?!.*begin-command(?:-if)?.*command(?:-if)?-end)

this will have to read to the end of your file every time it is encountered (with the modifiers that are being used)

making your .*'s lazy may earn you a bit of a performance boost on those large files:

(?!.*?begin-command(?:-if)?.*?command(?:-if)?-end)

also if the (?:-if)? is always going to come after begin-command you can just get rid of it there, would make it something like:

(?!.*?begin-command.*?command(?:-if)?-end)  

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...