Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
316 views
in Technique[技术] by (71.8m points)

python - Getting specific field from chosen Row in Pyspark DataFrame

I have a Spark DataFrame built through pyspark from a JSON file as

sc = SparkContext()
sqlc = SQLContext(sc)

users_df = sqlc.read.json('users.json')

Now, I want to access a chosen_user data, where this is its _id field. I can do

print users_df[users_df._id == chosen_user].show()

and this gives me the full Row of the user. But suppose I just want one specific field in the Row, say the user gender, how would I obtain it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Just filter and select:

result = users_df.where(users_df._id == chosen_user).select("gender")

or with col

from pyspark.sql.functions import col

result = users_df.where(col("_id") == chosen_user).select(col("gender"))

Finally PySpark Row is just a tuple with some extensions so you can for example flatMap:

result.rdd.flatMap(list).first()

or map with something like this:

result.rdd.map(lambda x: x.gender).first()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...