Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
521 views
in Technique[技术] by (71.8m points)

linux - Randomly Pick Lines From a File Without Slurping It With Unix

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...