I have seen many questions related to filtering pyspark dataframes but despite my best efforts I haven't been able to get any of the non-SQL solutions to work.
+----------+-------------+-------+--------------------+--------------+---+
|purch_date| purch_class|tot_amt| serv-provider|purch_location| id|
+----------+-------------+-------+--------------------+--------------+---+
|03/11/2017|Uncategorized| -17.53| HOVER | | 0|
|02/11/2017| Groceries| -70.05|1774 MAC'S CONVEN...| BRAMPTON | 1|
|31/10/2017|Gasoline/Fuel| -20| ESSO | | 2|
|31/10/2017| Travel| -9|TORONTO PARKING A...| TORONTO | 3|
|30/10/2017| Groceries| -1.84| LONGO'S # 2| | 4|
This did not work:
df1 = spark.read.csv("/some/path/to/file", sep=',')
.filter((col('purch_location')=='BRAMPTON')
And this did not work
df1 = spark.read.csv("/some/path/to/file", sep=',')
.filter(purch_location == 'BRAMPTON')
This (SQL expression) works but takes a VERY long time, I imagine there's a faster non-SQL approach
df1 = spark.read.csv("/some/path/to/file", sep=',')
.filter(purch_location == 'BRAMPTON')
UPDATE I should mention I am able to use methods like (which run faster than the SQL expression):
df1 = spark.read.csv("/some/path/to/file", sep=',')
df2 = df1.filter(df1.purch_location == "BRAMPTON")
But want to understand why the "pipe" /
connection syntax is incorrect.
question from:
https://stackoverflow.com/questions/65623336/how-to-filter-pyspark-dataframes 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…