Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
730 views
in Technique[技术] by (71.8m points)

scala - Select Specific Columns from Spark DataFrame

I have loaded CSV data into a Spark DataFrame.

I need to slice this dataframe into two different dataframes, where each one contains a set of columns from the original dataframe.

How do I select a subset into a Spark dataframe, based on columns ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you want to split you dataframe into two different ones, do two selects on it with the different columns you want.

 val sourceDf = spark.read.csv(...)
 val df1 = sourceDF.select("first column", "second column", "third column")
 val df2 = sourceDF.select("first column", "second column", "third column")

Note that this of course means that the sourceDf would be evaluated twice, so if it can fit into distributed memory and you use most of the columns across both dataframes it might be a good idea to cache it. It it has many extra columns that you don't need, then you can do a select on it first to select on the columns you will need so it would store all that extra data in memory.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...