最近读聚类,有一段文章没读懂,求解释
决策树下聚类算法:
The basic idea is that we regard each data record (or point) in the
dataset to have a class Y. We then assume that the data space is
uniformly distributed with another type of points, called nonexisting
points. We give them the class, N. With the N points
added to the original data space, our problem of partitioning the
data space into data (dense) regions and empty (sparse) regions
becomes a classification problem. A decision tree algorithm can
be applied to solve the problem. However, for the technique to
work many important issues have to be addressed (see Section 2).
We now use an example to show the intuition behind the
proposed technique. Figure 1(A) gives a 2-dimensional space,
which has 24 data (Y) points. Two clusters exist in the space. We
then add some uniformly distributed N points (represented by “o”)
to the data space (Figure 1(B)). With the augmented dataset, we
can run a decision tree algorithm to obtain a partitioning of the
space (Figure 1(B)). The two clusters are identified.
The reason that this technique works is that if there are clusters in
the data, the data points cannot be uniformly distributed in the
entire space. By adding some uniformly distributed N points, we
can isolate the clusters because within each cluster region there
are more Y points than N points. The decision tree technique is
well known for this task.