Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
430 views
in Technique[技术] by (71.8m points)

cluster analysis - Running DBSCAN in ELKI

I am trying to cluster some geospatial data, and I previously tried the WEKA library. I found this benchmarking, and decided to try ELKI.

Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus.

However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms.

For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were:

minpts=50 epsilon=0.008

I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be:

a) in the parametrization of the algorithm; b) reading the results (most likely).

Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI:

ListParameterization params2 = new ListParameterization();
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID,        minPoints);
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon);

Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon.

    ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class);
    for (Clustering<?> c : cs) {
        System.out.println("clusters: " + c.getAllClusters().size());
      for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) {
              if (!cluster.isNoise()){
                 Coordinate[] ptList=new Coordinate[cluster.size()];
                 int ct=0;
                    for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) {
                        ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
                        ++ct;
                    }                   
                //there are no "empty" clusters
                assertTrue(ptList.length>0);

                GeoPolygon poly=getBoundaryFromCoordinates(ptList);
                if (poly.getCoordinates().getGeometryType()==
                        "Polygon"){

                    try {
                        out.write(poly.coordinates.toText()+"
");
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }           

                }else
                    System.out.println(
                            poly.getCoordinates().getGeometryType());

              }//!noise
      }
    }

I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it). I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet:

  • What is the difference between getAllClusters() and getTopLevelClusters()?
  • Are the DBSCAN clusters "nested", i.e.: can we have points that belong to many clusters at the same time? Why?
  • I read somewhere that we should not use the database IDs to identify the points, as they are for ELKI's internal use, but what other way there is to get the list of points in each cluster? I read that you can use a relation for the labels, but I am not sure how to actually implement this...

Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is mostly a follow-up to @Anony-Mousse, who gave a pretty complete answer.

  • getTopLevelClusters() and getAllClusters() do the same for DBSCAN, as DBSCAN does not produce hierarchical clusters.
  • DBSCAN clusters are disjoint. Treating clusters with isNoise()==true as singleton objects is likely the best way to handling noise. Clusters returned by our OPTICSXi implementation are also disjoint, but you should consider the members of all child clusters to be part of the outer cluster. For convex hulls, an efficient approach is to first compute the convex hull of the child clusters; then for the parent compute the convex hull on the additional objects + the convex hull points of all childs.
  • The RangeDBIDs approach mentioned by @Anony-Mousse is pretty clean for static databases. A clean approach that also works with dynamic databases is to have an additional relation that identifies the objects. When using a CSV file as input, instead of relying on the line numbering to be consistent, you would just add a non-numeric column, containing labels e.g. object123. This is the best approach from a logical point of view - if you want to be able to identify objects, give them a unique identifier. ;-)
  • We use ELKI for teaching, and we have verified its DBSCAN algorithm very very carefully (you can find a DBSCAN step by step demonstration here, and ELKI results exactly match this). The DBSCAN and OPTICS code in Weka was contributed by a student a long time ago, and has never been verified to the same extend. From a quick check, Weka does not produce the correct results on our class exercise data set.
  • Because the exercise data set has the same extend of 10 in each dimension, we can adjust the epsilon parameter by 1/10, and then the Weka result seems to match the solution; so @Anony-Mousses finding appears to be correct: Weka's implementation enforces a [0;1] scaling on the data.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...