Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.5k views
in Technique[技术] by (71.8m points)

lapply - R funtion to merge rows by id and create separate columns

I have a list of articles obtained from an API, with my dataframe looking like this:

PMID        Year     Title                  Journal         Author 
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles
33326729    2020     Avelumab Maintenance   PLoS biology    B., Huang
33326729    2020     Avelumab Maintenance   PLoS biology    A., Di Pietro

Which I need to merge to this:

PMID        Year     Title                  Journal         Author-1         Author-2     Author-3
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles       B., Huang    A., Di Pietro

So basically, I need the articles to have the authors merged in a single row. I thought of sorting by id by doing as follows:

test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]

Outputs:
33326729    2020,2020,2020     Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance   PLoS biology,PLoS biology,PLoS biology    T., Powles,B., Huang,A., Di Pietro

However, that generates the data with commas instead of separate columns. Does anyone know of a different function or how to adapt the setDT function in order to get my intended result? Thanks in advance

Edit: As requested the output of dput(head(PubMed_df)) :

structure(list(pmid = c("33326729", "33326729", "33326729", "33320856", 
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021", 
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles", 
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas", 
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs", 
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom", 
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom", 
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine", 
"The New England journal of medicine", "PLoS biology", "PLoS biology", 
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018", 
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030", 
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))

Edit 2: Highly detailed and specific request:

I need to get the data of which the head is presented above into a form in which every row has: PMID | Date of publication | Author 1 | Affiliation | Address | City | State (if US) | Country | Author 2 | Affiliation of Author 2 | Address | City | State (if US) | Country | And so on for each co-Author | Journal | Title | Abstract* | MH term

I will have to break up the addresses but thats something ill be focusing on later. For now my goal is to just get all the info for every author added to the right article without having 3 rows of the same article.

Edit 2 - To be used to get the answer from @r2evans to work in my case: The answer provided works if you use dcast as data.table::dcast!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is mostly a dupe from Rui's comment, but it helps to add a help-column to get it (I'll use row here). Since you started using data.table, I'll stick with that.

Edited to work with the updated data. (I'm assuming that pmid uniquely defines the groups.)

library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]

And in über-wide format:

dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
       pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1                               address_2                               address_3
     <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>                                  <char>                                  <char>
1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns   Thompson     Acosta  Martin A M      Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ...                      Pfizer, Groton, CT                    Pfizer, Milan, Italy

Realize that when you have papers with fewer authors than the max number of authors in your dataset, they will have empty/NA columns. For instance, if I remove rows 5-6 and do the same,

PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
#        pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1          address_2            address_3
#      <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>             <char>               <char>
# 1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns       <NA>       <NA>  Martin A M        <NA>        <NA> MRC Human Genetics Unit, MRC Institu...               <NA>                 <NA>
# 2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...