Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
934 views
in Technique[技术] by (71.8m points)

pyspark - When is it appropriate to use a UDF vs using spark functionality?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is quite simple: it is recommended to rely as much as possible on Spark's built-in functions and only use a UDF when your transformation can't be done with the built-in functions.

UDFs cannot be optimized by Spark's Catalyst optimizer, so there is always a potential decrease in performance. UDF's are expensive because they force representing data as objects in the JVM.

As you have also used the tag [pyspark] and as mentioned in the comment below, it might be of interest that "Panda UDFs" (aka vectorized UDFs) avoid the data movement between the JVM and Python. Instead they use Apache Arrow to transfer data and Pandas to process it. You can use Panda UDFs by using pandas_udf and read more about it in the Databricks blog Introducing Pandas UDF for PySpark which has a dedicated section on Performance Comparison.

Your peers might have used many UDFs because the built-in functions were not available on earlier version of Spark. Every release there are more functions being added.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...