By referencing the object containing your broadcast variable in your map
lambda, Spark will attempt to serialize the whole object and ship it to workers. Since the object contains a reference to the SparkContext, you get the error. Instead of this:
pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID]))
Try this:
bcast = pt.broadcast_products_lookup_map
pairs = distinct_users_projected.map(lambda x: (x.user_id, bcast.value[x.Prod_ID]))
The latter avoids the reference to the object (pt
) so that Spark only needs to ship the broadcast variable.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…