There are at least two different ways you can approach this either by aliasing:
df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")
or using name-based equality joins:
// Note that it will result in ambiguous column names
// so using aliases here could be a good idea as well.
// df.as("df1").join(df.as("df2"), Seq("foo"))
df.join(df, Seq("foo"))
In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext
/ standard SQLContext
) if you use raw expressions.
Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame
objects.
Regarding performance unless you're interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…