Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
418 views
in Technique[技术] by (71.8m points)

cluster analysis - Clustering 500,000 geospatial points in python

I'm currently faced with the problem of finding a way to cluster around 500,000 latitude/longitude pairs in python. So far I've tried computing a distance matrix with numpy (to pass into the scikit-learn DBSCAN) but with such a large input it quickly spits out a Memory Error.

The points are stored in tuples containing the latitude, longitude, and the data value at that point.

In short, what is the most efficient way to spatially cluster a large number of latitude/longitude pairs in python? For this application, I'm willing to sacrifice some accuracy in the name of speed.

Edit: The number of clusters for the algorithm to find is unknown ahead of time.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Older versions of DBSCAN in scikit learn would compute a complete distance matrix.

Unfortunately, computing a distance matrix needs O(n^2) memory, and that is probably where you run out of memory.

Newer versions (which version do you use?) of scikit learn should be able to work without a distance matrix; at least when using an index. At 500.000 objects, you do want to use index acceleration, as this reduces runtime from O(n^2) to O(n log n).

I don't know how well scikit learn supports geodetic distance in its indexes though. ELKI is the only tool I know that can use R*-tree indexes for accelerating geodetic distance; making it extremely fast for this task (in particular when bulk-loading the index). You should give it a try.

Have a look at the Scikit learn indexing documentation, and try setting algorithm='ball_tree'.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...