More effective merging of matched column with duplicates in data.table

Spread the love

Question Description

I’ve got two data.tables, both of which share one variable; I’m trying to add a variable that’s missing from the second, but which is tied one-for-one to the shared variable.

This is clearly a merge, but because the shared variable has multiple instances, I’m having to use what feels like a workaround to merge the new variable.

Let’s get specific.

x <- data.table(let = rep(letters[1:3], 2:4),
                num = rep(1:3, 2:4), other = rnorm(9))
y <- data.table(let = rep(c("a", "c"), c(10, 6)))

   let num       other
1:   a   1 -0.41695882
2:   a   1 -0.59875888
3:   b   2 -0.19433915
4:   b   2  0.58406046
5:   b   2 -0.33922321
6:   c   3 -0.63076561
7:   c   3  1.06987710
8:   c   3  0.08869372
9:   c   3 -1.31196123

 1:   a
 2:   a
 3:   a
 4:   a
 5:   a
 6:   a
 7:   a
 8:   a
 9:   a
10:   a
11:   c
12:   c
13:   c
14:   c
15:   c
16:   c

I just want to add the num column to y; since num is matched 1-1 with let, it doesn't really matter that there's duplicates.

Here's an approach that works; I've just got a feeling there's something simpler.

setkey(x, let)
setkey(y, let)

y <- x[!duplicated(let), c("let", "num"), with = FALSE][y]

Practice As Follows

The only improvements that I could think of is that

  1. You could skip the setkey(x, let) part

  2. You could also update y by reference (rather than creating a copy using <- and then assigning back to y)

If you are using the current stable version version of data.table (v <= 1.9.4) you will have to use allow.cartesian = TRUE

y[x[!duplicated(let)], num := i.num, allow.cartesian = TRUE][]

You could alternatively use unique instead of duplicated (they both have data.table methods)

y[unique(x, by = "let"), num := i.num, allow.cartesian = TRUE]

Here's another possibility using the new .EACHI method, although there's no need for the use of by=.EACHI here. I've shown you just to expose this feature for you. Have a look at this post for a detailed explanation of what this does and when it's useful.

y[x, num := unique(i.num), by = .EACHI, allow.cartesian = TRUE]

Edit: (Thanks to @Arun for pointing this out)

We shouldn't need allow.cartesian argument here, as there are no duplicates in i. In fact, it's a bug, #742 that has been fixed in the current development version (1.9.5). So you just need to do:

y[x[!duplicated(let)], num := i.num]
# or
y[unique(x, by = "let"), num := i.num]
# or (though not recommended in this specific case)
y[x, num := unique(i.num), by = .EACHI]

Leave a Comment