Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
408 views
in Technique[技术] by (71.8m points)

data.table in R - multiple filters using multiple keys - binary search

I don't understand how I can filter based on multiple keys in data.table. Take the built-in mtcars dataset.

DT <- data.table(mtcars)
setkey(DT, am, gear, carb)

Following the vignette, I know that if I want to have filtering that corresponds to am == 1 & gear == 4 & carb == 4, I can say

> DT[.(1, 4, 4)]
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1:  21   6  160 110  3.9 2.620 16.46  0  1    4    4
2:  21   6  160 110  3.9 2.875 17.02  0  1    4    4

and it gives the correct result. Furthermore, if I want to have am == 1 & gear == 4 & (carb == 4 | carb == 2), this also works

> DT[.(1, 4, c(4, 2))]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
4: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

However, when I want to have am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2), the plausible

> DT[.(1, c(3, 4), c(4, 2))]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1:   NA  NA    NA  NA   NA    NA    NA NA  1    3    4
2: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

fails. Could you please explain to me what is the right approach here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The reason you didn't get an error from your query is that data.table will reuse values when they're multiples of other values. In other words, because the 1 for am can be used 2 times, it does this without telling you. If you were to do a query where the number of allowable values weren't multiples of each other then it would give you a warning. For example

DT[.(c(1,0),c(5,4,3),c(8,6,4))]

will give you a warning complaining about a remainder of 1 item, the same error you would see when typing data.table(c(1,0),c(5,4,3),c(8,6,4)). Whenever merging X[Y], both X and Y should be thought of as data.tables.

If you instead use CJ,

DT[CJ(c(1,0),c(5,4,3),c(8,6,4))]

then it will make every combination of all the values for you and data.table will give the results you expect.

From the vignette (bolding is mine):

What’s happening here? Read this again. The value provided for the second key column “MIA” has to find the matching vlaues in dest key column on the matching rows provided by the first key column origin. We can not skip the values of key columns before. Therfore we provide all unique values from key column origin. “MIA” is automatically recycled to fit the length of unique(origin) which is 3.

Just for completeness, the vector scan syntax will work without using CJ

DT[am == 1 & gear == 4 & carb == 4]

or

DT[am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2)]

How do you know if you need a binary search? If the speed of subsetting is unbearable then you need a binary search. For example, I've got a 48M row data.table I'm playing with and the difference between a binary search and a vector is staggering relative to one another. Specifically a vector scan takes 1.490 seconds in elapsed time but a binary search only takes 0.001 seconds. That, of course, assumes that I've already keyed the data.table. If I include the time it takes to set the key then the combination of setting the key and performing the subset is 1.628. So you have to pick your poison


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...