algorithm - Intersection of files

Question

Welcome To Ask or Share your Answers For Others

algorithm - Intersection of files

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - Intersection of files

I two large files (27k lines and 450k lines). They look sort of like:

File1:
1 2 A 5
3 2 B 7
6 3 C 8
...

File2:
4 2 C 5
7 2 B 7
6 8 B 8
7 7 F 9
...

I want the lines from both files in which the 3rd column is in both files (note lines with A and F were excluded):

OUTPUT:
3 2 B 7
6 3 C 8
4 2 C 5
7 2 B 7
6 8 B 8

whats the best way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:06:54+0000

first we sort the files on the third field :

sort -k 3 file1 > file1.sorted
sort -k 3 file2 > file2.sorted

then we get common values on the 3rd field using comm :

comm -12 <(cut -d " " -f 3 file1.sorted | uniq) <(cut -d " " -f 3 file2.sorted | uniq) > common_values.field

now we can join each sorted file on the common values :

join -1 3 -o '1.1,1.2,1.3,1.4' file1.sorted common_values.field > file.joined
join -1 3 -o '1.1,1.2,1.3,1.4' file2.sorted common_values.field >> file.joined

output is formated so we get the same field order as the one used in the files. Standard unix tools used : sort, comm, cut, uniq, join. The <( ) works with bash, for other shells you might use temp files instead.

Categories

algorithm - Intersection of files

algorithm - Intersection of files

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags