Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

regex - Perl replace nested blocks regular expression

I need to get the nested blocks in hash array or hash tree to be able to substitute the blocks with dynamic contents. I need to replace the code between

<!--block:XXX-->

and the first closing end block

<!--endblock--> 

with my dynamic content.

I have this code that finds one level comments blocks but not nested:

#<!--block:listing-->... html code block here ...<!--endblock-->
$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;

Here is the complete nested html template that I want to process. So I need to find and replace the inner block "block:third" and replace it with my content , then find "block:second" and replace it then find the outer block "block:first" and replace it. Please note that, there can be any number of nested blocks and not just three like the example below, it could be several nested blocks.

use Data::Dumper;

$content=<<HTML;
some html content here

<!--block:first-->
    some html content here

    <!--block:second-->
        some html content here

        <!--block:third-->
            some html content here
        <!--endblock-->

        some html content here
    <!--endblock-->

    some html content here
<!--endblock-->
HTML

$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;
print Dumper(%blocks);

So I can access and modify the blocks like $block{first} = "my content here" and $block{second} = "another content here" etc then replace the blocks.

I created this regex

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Update:

This is a response to the "combining" into a single regex...

It appears you don't care about reconstructing the order of the html.
So, if you just want to isolate the content for each sub-section, the below is all you need.
However, you will need lists ( [] ) to reconstitute the order of embedded sub-sections.

After refreshing myself with this question, note that the regex used below is the one you should be using.

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my $href = {};

ParseCore( $href, $content );

#print Dumper($href);

print "
Base======================
";
print $href->{content};
print "
First======================
";
print $href->{first}->{content};
print "
Second======================
";
print $href->{first}->{second}->{content};
print "
Third======================
";
print $href->{first}->{second}->{third}->{content};
print "
Fourth======================
";
print $href->{first}->{second}->{third}->{fourth}->{content};
print "
Fifth======================
";
print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content};

exit;

sub ParseCore
{
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(?is)(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--block:.*?-->).)+))/g )
    {
       if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $1;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
        }
        else
        {
           $aref->{content} .= $4;
        }
    }
    return $k;
}

#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

Output >>

Base======================
some html content here top base

some html content here1-5 bottom base

some html content here 6-8 top base

some html content here 6-8 bottom base
First======================

    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top

    some html content here 1 bottom

Second======================

        some html content here 2 top

        some html content here 2 bottom

Third======================

            some html content here 3 top

            some html content here 3a
            some html content here 3b

Fourth======================

                some html content here 4 top


Fifth======================

                    some html content here 5a
                    some html content here 5b

You can use REGEX recursion to match outter nesting's, then parse the inner CORE's
using a simple recursive function call.

Then its also possible to parse content on the nesting level that you are on.
Its also possible to create a nested structure along the way to enable you to later
do the template substitutions.

You can then reconstruct the html.
The only tricky part is traversing the array. But, if you know how to traverse
array's (scalars, array/hash ref's, and such) it should be no problem.

Here is the sample.

    # (?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)

    (?is)                         # Modifiers: Case insensitive, Dot-all
    <!--block:                    # Begin BLOCK
    ( .*? )                       # (1), block name
    -->

    (                             # (2 start), Begin Core
         (?:
              (?:
                   (?!
                        <!--
                        (?: .*? )
                        -->
                   )
                   . 
              )
           |  (?R) 
         )*?
    )                             # (2 end), End Core

    <!--endblock-->               # End BLOCK
 |  
    (                             # (3 start), Or grab content within this core
         (?:
              (?! <!-- .*? --> )
              . 
         )+
    )                             # (3 end)

Perl test case

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my %blocks = ();
$blocks{'base'} = [];


ParseCore( $blocks{'base'}, $content );


sub ParseCore
{
    my ($aref, $core) = @_;
    while ( $core =~ /(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g )
    {
        if ( defined $1 )
        {
           my $branch = {};
           push @{$aref}, $branch;
           $branch->{$1} = [];
           ParseCore( $branch->{$1}, $2 );
        }
        elsif ( defined $3 )
        {
           push @{$aref}, $3;
        }
    }

}

print Dumper(\%blocks);

__DATA__

some html content here top base
<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

Output >>

$VAR1 = {
          'base' => [
                      '
some html content here top base
',
                      {
                        'first' => [
                                     '
    some html content here 1 top
    ',
                                     {
                                       'second' => [
                                                     '
        some html content here 2 top
        ',
                                                     {
                                                       'third' => [
                                                                    '
            some html content here 3a
            some html content here 3b
        '
                                                                  ]
                                                     },
                                                     '
        some html content here 2 bottom
    '
                                                   ]
                                     },
                                     '
    some html content here 1 bottom
'
                                   ]
                      },
                      '
some html content here bottom base
'
                    ]
        };

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...