Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
143 views
in Technique[技术] by (71.8m points)

python - How to filter pyspark dataframes

I have seen many questions related to filtering pyspark dataframes but despite my best efforts I haven't been able to get any of the non-SQL solutions to work.

+----------+-------------+-------+--------------------+--------------+---+
|purch_date|  purch_class|tot_amt|       serv-provider|purch_location| id|
+----------+-------------+-------+--------------------+--------------+---+
|03/11/2017|Uncategorized| -17.53|             HOVER  |              |  0|
|02/11/2017|    Groceries| -70.05|1774 MAC'S CONVEN...|     BRAMPTON |  1|
|31/10/2017|Gasoline/Fuel|    -20|              ESSO  |              |  2|
|31/10/2017|       Travel|     -9|TORONTO PARKING A...|      TORONTO |  3|
|30/10/2017|    Groceries|  -1.84|         LONGO'S # 2|              |  4|

This did not work:

df1 = spark.read.csv("/some/path/to/file", sep=',')
            .filter((col('purch_location')=='BRAMPTON')

And this did not work

df1 = spark.read.csv("/some/path/to/file", sep=',')
            .filter(purch_location == 'BRAMPTON')

This (SQL expression) works but takes a VERY long time, I imagine there's a faster non-SQL approach

df1 = spark.read.csv("/some/path/to/file", sep=',')
            .filter(purch_location == 'BRAMPTON')

UPDATE I should mention I am able to use methods like (which run faster than the SQL expression):

df1 = spark.read.csv("/some/path/to/file", sep=',')
df2 = df1.filter(df1.purch_location == "BRAMPTON")

But want to understand why the "pipe" / connection syntax is incorrect.

question from:https://stackoverflow.com/questions/65623336/how-to-filter-pyspark-dataframes

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you insist on using the backslash, you can do:

from pyspark.sql.functions import col

df = spark.read.csv('/some/path/to/file', sep=',') 
     .filter(col('purch_location') == 'BRAMPTON')

Your first attempt failed because the brackets are not balanced.

Also it seems there are some spaces after the string BRAMPTON, so you might want to trim the column first:

from pyspark.sql.functions import col, trim

df = spark.read.csv('/some/path/to/file', sep=',') 
     .filter(trim(col('purch_location')) == 'BRAMPTON')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...