Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

regex - awk error "makes too many open files"

I have a awk based splitter that splits a huge file based on regex. But the problem is that I am getting a makes too many files error. Even i have a conditional close. If you could help me figure out what I am doing wrong I would be much grateful.

    awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
    /<doc/{x="F"++i".xml";}{
     if (i%5==0 ){
       ++i;
       close("splitted/sub"j"/"x);
       system("mkdir -p splitted/sub"++j"/");
      }
     else{
       print > ("splitted/sub"j"/"x);
     }
    }' wiki_parsed.xml
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The simple answer is that close isn't being called often enough. Here's an illustrative example of why:

Using an input file like:

<doc somestuff
another line
yet another line
<doc the second
still more data
<doc the third
<doc the fourth
<doc the fifth

I can make an executable awk file based on your script like:

#!/usr/bin/awk -f

BEGIN { system_(++j) }

/<doc/{x=++i}

{
    if (i%5==0 ){ ++i; close_(j"/"x); system_(++j) }
    else{ open_(j"/"x) }
}

function call_f(funcname, arg) { print funcname"("arg")" }

function system_(cnt) { call_f( "system", cnt ) }
function open_(f) { if( !(f in a) ) { call_f( "open", f ); a[f]++ } }
function close_(f) { call_f( "close", f ) }

which if I put into a file called awko can be run like awko data to produce the following:

system(1)
open(1/1)
open(1/2)
open(1/3)
open(1/4)
close(1/5)
system(2)

The script I made is just indicating how many times you're calling each function by shadowing a real function call with a local function with a trailing _. Notice how many times open() is printed compared to close() for the same arguments. Also, I ended up renaming print > to open_ just to illustrated that it's what's opening the files( once per file name ).

If I change the executable awk file to the following, you can see close being called enough:

#!/usr/bin/awk -f

BEGIN { system_(++j) }

/<doc/{ close_(j"/"x); x=++i } # close_() call is moved to here.

{
    if (i%5==0 ){ ++i; system_(++j) }
    else{ open_(j"/"x) }
}

function call_f(funcname, arg) { print funcname"("arg")" }

function system_(cnt) { call_f( "system", cnt ) }
function open_(f) { if( !(f in a) ) { call_f( "open", f ); a[f]++ } }
function close_(f) { call_f( "close", f ) }

which gives the following output:

system(1)
close(1/)
open(1/1)
close(1/1)
open(1/2)
close(1/2)
open(1/3)
close(1/3)
open(1/4)
close(1/4)
system(2)

where it should be clear that close() is being called one more time than enough. The first time it's being called on a file that doesn't exist. With a true close() call, the fact that such a file has never been printed should just be ignored and no actual close will be attempted. In each other case, the last open() matches a close() call.

Moving your close() call in your script as in the second example script should fix your error.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...