Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
651 views
in Technique[技术] by (71.8m points)

algorithm - Intersection of files

I two large files (27k lines and 450k lines). They look sort of like:

File1:
1 2 A 5
3 2 B 7
6 3 C 8
...

File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
... 

I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):

OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

whats the best way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

first we sort the files on the third field :

sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted

then we get common values on the 3rd field using comm :

comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

now we can join each sorted file on the common values :

join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <( ) works with bash, for other shells you might use temp files instead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...