Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
292 views
in Technique[技术] by (71.8m points)

pyspark - Reading a text file with multiple headers in Spark

I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame

STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060101    33.5 23
010010 99999  20060102    35.3 23
010010 99999  20060103    34.4 24
STN--- WBAN   YEARMODA    TEMP     
010010 99999  20060120    35.2 22
010010 99999  20060121    32.2 21
010010 99999  20060122    33.0 22
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
  1. You can read the text file as a normal text file in an RDD
  2. You have a separator in the text file, let's assume it's a space
  3. Then you can remove the header from it
  4. Remove all lines inequal to the header
  5. Then convert the RDD to a dataframe using .toDF(col_names)

Like this:

rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...