Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
391 views
in Technique[技术] by (71.8m points)

data.table - Applying same factor levels to multiple variables with differing amount of levels in R

I have a data.table with 168 variables and 8,278 observations. Variables 69:135 are originally stored as strings. They should become region dummies and I want to end up with levels 2 (=yes, company operates here) and 1 (=no, company does not operate here). The issue is that there are three different combinations of input in the orginal variables: 1) "TRUE", "1", "0", "FALSE", 2) "TRUE", "FALSE", and 3) "1", "0". Moreover, approx. 5 variables do only have one value, either "0" or "1". An example is given here:

#generating replicable data
structure(list(
  region1 = structure(c("TRUE", "FALSE", "0", "1", NA), class = "character"), 
  region2 = structure(c("1", "1", "0", NA, NA), class = "character"), 
  region3 = structure(c(NA, "FALSE", "TRUE", NA, "FALSE"), class = "character"),
  region4 = structure(c(NA, "0", "0", NA, "0"), class = "character")),
  .Names = c("region1", "region2", "region3", "region4"), row.names = c(NA, 5), class = "data.table")

#this gives:
#   region1 region2 region3 region4
#1    TRUE       1    <NA>    <NA>
#2   FALSE       1   FALSE       0
#3       0       0    TRUE       0
#4       1    <NA>    <NA>    <NA>
#5    <NA>    <NA>   FALSE       0                                                                                      

I'm searching for a way to replace "TRUE" and "1" by 2 and "FALSE" and "0" by 1 for all the variables at once. So the desired result is:

#   region1 region2 region3 region4
#1:       2       2      NA      NA
#2:       1       2       1       1
#3:       1       1       2       1
#4:       2      NA      NA      NA
#5:      NA      NA       1       1

I've already had a look at

Apply factor levels to multiple columns with missing factor levels and Change level of multiple factor variables.

However, this does not help me.

I tried the following using a nested ifelse() command:

library(data.table)
library(forcats)

check <- cbind(dt[1:68], as.data.table(apply(dt[69:135], 2, function(x) {
  ifelse("1" %in% x & "TRUE" %in% x,
         fct_collapse(x,
                      "2" = c("TRUE",
                              "1"),
                      "1" = c("FALSE",
                              "0")
         ),
         ifelse("1" %in% x & !("TRUE" %in% x),
                fct_collapse(x,
                             "2" = "1",
                             "1" = "0"),
                fct_collapse(x,
                             "2" = "TRUE",
                             "1" = "FALSE"
                )))
}
)), dt[136:168]) 

But the previous code does not give me the desired result. It runs through but I get a warning message and when checking the respective variables they are still stored as strings with their original inputs.

# examples of warnings
1: Unknown levels in `f`: TRUE, FALSE
2: Unknown levels in `f`: TRUE, FALSE
3: Unknown levels in `f`: TRUE, FALSE
4: Unknown levels in `f`: 0
5: Unknown levels in `f`: TRUE, FALSE
6: Unknown levels in `f`: TRUE, FALSE
7: Unknown levels in `f`: 0

On their own and also when not being combined with fct_collapsethe nested ifelse() commands do the job:

#the ifelse statement works
ifelse("TRUE" %in% dt$region1, 2, "FALSE")
ifelse(5 %in% dt$region1, 2, "FALSE")

#also the nested ifelse statement works
ifelse("1" %in% dt$region1 & "TRUE" %in% dt$region1,
              0,
              ifelse("1" %in% dt$region1 & !("TRUE" %in% dt$region1),
                     1,
                     2
              ))


ifelse("1" %in% dt$region2 & "TRUE" %in% dt$region2,
               0,
               ifelse("1" %in% dt$region2 & !("TRUE" %in% dt$region2),
                      1,
                      2
               ))

Does anyone have an idea how to solve this?

Thanks so much for any advice in advance!

question from:https://stackoverflow.com/questions/65645399/applying-same-factor-levels-to-multiple-variables-with-differing-amount-of-level

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is a way with set() called in a for loop.

library(data.table)

f <- function(x){
  x <- as.character(x)
  i1 <- x %in% c("TRUE", "1")
  i0 <- x %in% c("FALSE", "0")
  x[which(i1)] <- "2"
  x[which(i0)] <- "1"
  as.integer(x)
}

for (j in seq_along(dt)) set(dt, j = j, value = f(dt[[j]]))

dt
#   region1 region2 region3 region4
#1:       2       2      NA      NA
#2:       1       2       1       1
#3:       1       1       2       1
#4:       2      NA      NA      NA
#5:      NA      NA       1       1

Thanks to jangorecki's comment a much simpler way is

dt[, names(dt) := lapply(dt, f)]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...