Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
776 views
in Technique[技术] by (71.8m points)

bash - Sampling without replacement using awk

I have a lot of text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

Is there a way to do a sampling without replacement using awk?

For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement. The output should look something like this:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

Thanks in advance

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How about this for a random sampling of 10% of your lines?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.

Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.

For added randomness (!) you can add an srand() at the start as suggested by @klashxx

awk 'BEGIN{srand()} rand()>0.9' yourfile(s)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...