cluster analysis - Why is Spark Mllib KMeans algorithm extremely slow?

Question

Welcome To Ask or Share your Answers For Others

cluster analysis - Why is Spark Mllib KMeans algorithm extremely slow?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

cluster analysis - Why is Spark Mllib KMeans algorithm extremely slow?

I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it comes to KMeans it just goes into a black hole. I am now trying a lower number of iterations (2 instead of 100), but I feel something is wrong somewhere.

KMeansModel Cs = KMeans.train(datamatrix, k, 100);//100 iteration, changed to 2 now. # of clusters k=1000 or 5000

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:27:36+0000

It looks like the reason is relatively simple. You use quite large k and combine it with an expensive initialization algorithm.

By default Spark is using as distributed variant of K-means++ called K-means|| (see What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?). Distributed version is roughly O(k) so with larger k you can expect slower start. This should explain why you see no improvement when you reduce number of iterations.

Using large K is also expensive when model is trained. Spark is using a variant of Lloyds which is roughly O(nkdi).

If you expect complex structure of the data there most likely a better algorithms out there to handle this than K-Means but if you really want to stick with it you start with using random initialization.

Categories

cluster analysis - Why is Spark Mllib KMeans algorithm extremely slow?

cluster analysis - Why is Spark Mllib KMeans algorithm extremely slow?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags