Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
604 views
in Technique[技术] by (71.8m points)

awk - Cross-reference multiple lists with master list to produce merged output with missing values as NA

master_list.txt

chr4:124900550:T:C
chr7:82771887:T:C
chr2:177046732:C:A
chr13:91709687:T:G
chr1:23216340:T:C
chr5:28993821:C:T
chr6:108644910:C:T
chr7:83971303:A:T
chr7:9570668:A:G
chr9:130711613:A:G
chr1:199677414:G:A
chr12:101848399:G:A
chr10:101438297:C:A

I have multiple data files that contain some, but not all, IDs in master_list.txt, with additional data for that ID. The additional data will have a consistent number of columns per data file, but a variable number of columns for different data files.

Based on shared IDs in column 1 (all files), I want to populate the master_list.txt with corresponding data from each data file if present. If there is no ID (+data) in a data file that matched with the master_list.txt file, "NAs" should be entered.

Based on the following (example) data files:

data_file_1

chr1:23216340:T:C, 1, 0, 0, 0, 1, 0
chr1:199677414:G:A, 0, 1, 0, 0.2, 0, 0.8
chr5:28993821:C:T, 1, 0, 0, 0.1, 0.1, 0.8
chr6:108644910:C:T, 0.9, 0.1, 0, 0.75, 0.25, 0
chr7:83971303:A:T, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr9:130711613:A:G, 0.8, 0.1, 0.1, 0, 1, 0
chr10:101438297:C:A, 1, 0, 0, 0.9, 0.1, 0
chr13:91709687:T:G, 1, 0, 0, 0.25, 0.25, 0.5

data_file_2

chr1:23216340:T:C, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr3:3141592653:A:T, 0, 1, 0, 0.95, 0.05, 0
chr4:124900550:T:C, 1, 0, 0, 0.9, 0.1, 0
chr6:108644910:C:T, 1, 0, 0, 0, 1, 0
chr7:82771887:T:C, 0.85, 0.1, 0.05, 0, 0, 1
chr10:101438297:C:A, 0.5, 0.5, 0, 1, 0, 0
chr12:101848399:G:A, 0.2, 0.2, 0.6, 0, 0, 1

data_file_3

chr2:177046732:C:A, 0.5, 0.5, 0, 1, 0, 0, 0.5. 0.5, 0
chr6:108644910:C:T, 0.95, 0.05, 0, 1, 0, 0, 1, 0, 0
chr7:83971303:A:T, 0.5, 0.5, 0, 0, 0, 1, 1, 0, 0
chr7:9570668:A:G, 0.9, 0.05, 0.05, 0, 0, 1, 0, 1, 0
chr9:130711613:A:G, 1, 0, 0, 1, 0, 0, 0, 0, 1
chr13:91709687:T:G, 1, 0, 0, 0, 0, 1, 0.3, 0.3, 0.4

Desired output is:

chr1:23216340:T:C, 1, 0, 0, 0, 1, 0, 0.9, 0.1, 0, 0.7, 0.2, 0.1, NA, NA, NA, NA, NA, NA, NA, NA
chr1:199677414:G:A, 0, 1, 0, 0.2, 0, 0.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr2:177046732:C:A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.5, 0.5, 0, 1, 0, 0, 0.5. 0.5, 0
chr4:124900550:T:C, NA, NA, NA, NA, NA, NA, 1, 0, 0, 0.9, 0.1, 0, NA, NA, NA, NA, NA, NA, NA, NA
chr5:28993821:C:T, 1, 0, 0, 0.1, 0.1, 0.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr6:108644910:C:T, 0.9, 0.1, 0, 0.75, 0.25, 0 , 1, 0, 0, 0, 1, 0, 0.95, 0.05, 0, 1, 0, 0, 1, 0, 0
chr7:9570668:A:G, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.9, 0.05, 0.05, 0, 0, 1, 0, 1, 0
chr7:82771887:T:C, NA, NA, NA, NA, NA, NA, 0.85, 0.1, 0.05, 0, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA
chr7:83971303:A:T, 0.9, 0.1, 0, 0.7, 0.2, 0.1, NA, NA, NA, NA, NA, NA, 0.9, 0.1, 0, 0.7, 0.2, 0.1
chr9:130711613:A:G, 0.8, 0.1, 0.1, 0, 1, 0, NA, NA, NA, NA, NA, NA, 1, 0, 0, 1, 0, 0, 0, 0, 1
chr10:101438297:C:A, 1, 0, 0, 0.9, 0.1, 0, 0.5, 0.5, 0, 1, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA 
chr12:101848399:G:A, NA, NA, NA, NA, NA, NA, 0.2, 0.2, 0.6, 0, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA
chr13:91709687:T:G, 1, 0, 0, 0.25, 0.25, 0.5, NA, NA, NA, NA, NA, NA, 1, 0, 0, 0, 0, 1, 0.3, 0.3, 0.4

Please note that data files can have a variable number of rows and columns and that the master_list.txt file will not always contain an ID found in any particular data file. If a data file has an ID (with data), that is not found in master_list.txt, that data does not end up in the output file.

I can use:

awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_1 > file_1_output
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_2 > file_2_output
awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file_3 > file_3_output

to match data in each data file to the master list, but I get no NAs this way. If I did, I could paste the per-data-file-outputs (which would all have row numbers equal to the master_list.txt file) to get the desired output file. So how do I tweak:

awk 'NR==FNR{c[$1]++;next};c[$1] > 0' master_list.txt data_file

to populate outputs with NAs (equal to the number of columns in the data file) when an ID in the master file is not found in the data file? A script that does not require subsequent pasting of individual outputs would be even better!

question from:https://stackoverflow.com/questions/65928283/cross-reference-multiple-lists-with-master-list-to-produce-merged-output-with-mi

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
$ cat tst.awk
NR==FNR {
    ids[++numIds] = $1","
    next
}
FNR==1 { numFiles++ }
{
    id = $1
    sub(/^[^[:space:]]+[[:space:]]+/,"")
    vals[id,numFiles] = $0
    gsub(/[^[:space:],]+/,"NA")
    naVal[numFiles] = $0
}
END {
    for ( idNr=1; idNr<=numIds; idNr++) {
        id = ids[idNr]
        printf "%s%s", id, OFS
        for (fileNr=1; fileNr<=numFiles; fileNr++) {
            val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
            printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk master file1 file2 file3
chr4:124900550:T:C, NA, NA, NA, NA, NA, NA 1, 0, 0, 0.9, 0.1, 0 NA, NA, NA, NA, NA, NA, NA, NA, NA
chr7:82771887:T:C, NA, NA, NA, NA, NA, NA 0.85, 0.1, 0.05, 0, 0, 1 NA, NA, NA, NA, NA, NA, NA, NA, NA
chr2:177046732:C:A, NA, NA, NA, NA, NA, NA NA, NA, NA, NA, NA, NA 0.5, 0.5, 0, 1, 0, 0, 0.5. 0.5, 0
chr13:91709687:T:G, 1, 0, 0, 0.25, 0.25, 0.5 NA, NA, NA, NA, NA, NA 1, 0, 0, 0, 0, 1, 0.3, 0.3, 0.4
chr1:23216340:T:C, 1, 0, 0, 0, 1, 0 0.9, 0.1, 0, 0.7, 0.2, 0.1 NA, NA, NA, NA, NA, NA, NA, NA, NA
chr5:28993821:C:T, 1, 0, 0, 0.1, 0.1, 0.8 NA, NA, NA, NA, NA, NA NA, NA, NA, NA, NA, NA, NA, NA, NA
chr6:108644910:C:T, 0.9, 0.1, 0, 0.75, 0.25, 0 1, 0, 0, 0, 1, 0 0.95, 0.05, 0, 1, 0, 0, 1, 0, 0
chr7:83971303:A:T, 0.9, 0.1, 0, 0.7, 0.2, 0.1 NA, NA, NA, NA, NA, NA 0.5, 0.5, 0, 0, 0, 1, 1, 0, 0
chr7:9570668:A:G, NA, NA, NA, NA, NA, NA NA, NA, NA, NA, NA, NA 0.9, 0.05, 0.05, 0, 0, 1, 0, 1, 0
chr9:130711613:A:G, 0.8, 0.1, 0.1, 0, 1, 0 NA, NA, NA, NA, NA, NA 1, 0, 0, 1, 0, 0, 0, 0, 1
chr1:199677414:G:A, 0, 1, 0, 0.2, 0, 0.8 NA, NA, NA, NA, NA, NA NA, NA, NA, NA, NA, NA, NA, NA, NA
chr12:101848399:G:A, NA, NA, NA, NA, NA, NA 0.2, 0.2, 0.6, 0, 0, 1 NA, NA, NA, NA, NA, NA, NA, NA, NA
chr10:101438297:C:A, 1, 0, 0, 0.9, 0.1, 0 0.5, 0.5, 0, 1, 0, 0 NA, NA, NA, NA, NA, NA, NA, NA, NA

sort master first or pipe the output to sort if you want it sorted.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...