Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
749 views
in Technique[技术] by (71.8m points)

amazon s3 - is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).

Further explanation might be helpful since it might involve understanding on distributed file system.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .

  • I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
  • I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
  • I then used the Spark History Server to see how much data each job had as input.

Results:

  • For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
  • For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.

I will add more details about the tests and results when I have time.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...