Edit number Infinity: Faq 1.12 exactly answers your question: (Also useful/relevant is FAQ 1.13, not pasted here).
1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index. Y[X] is a join, looking up Y's rows using X (or X's key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that takes copies of the subsets of data, and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)]
, data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge followed by a subset?
Old answer which did nothing to answer the OP's question (from OP's comment), retained here because I believe it does).
When you give a value for j
like d[, 4]
or d[, value]
in data.table
, the j
is evaluated as an expression
. From the data.table FAQ 1.1 on accessing DT[, 5]
(the very first FAQ) :
Because, by default, unlike a data.frame, the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.
The first thing, therefore, to understand is, in your case:
d[, value] # produces a "vector"
# [1] 2 3 4 5 6
This is not different when the query for i
is a basic indexing like:
d[3, value] # produces a vector of length 1
# [1] 4
However, this is different when i
is by itself a data.table
. From data.table
introduction (page 6):
d[J(3)] # is equivalent to d[data.table(a = 3)]
Here, you are performing a join
. If you just do d[J(3)]
then you'd get all columns corresponding to that join. If you do,
d[J(3), value] # which is equivalent to d[J(3), list(value)]
Since you say this answer does nothing to answer your question, I'll point where the answer to your "rephrased" question, I believe, lies: ---> then you'd get just that column, but since you're performing a join, the key column will also be output'd (as it's a join between two tables based on the key column).
Edit: Following your 2nd edit, If your question is why so?, then I'd reluctantly (or rather ignorantly) answer, Matthew Dowle designed so to differentiate between a data.table join-based-subset
and a index-based-subset
ting operation.
Your second syntax is equivalent to:
d[J(3)][, value] # is equivalent to:
dd <- d[J(3)]
dd[, value]
where, again, in dd[, value]
, j
is evaluated as an expression and therefore you get a vector.
To answer your 3rd modified question: for the 3rd time, it's because it is a JOIN between two data.tables based on the key column. If I join two data.table
s, I'd expect a data.table
From data.table
introduction, once again:
Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact, the A[B] syntax in base R inspired the data.table package.