Prev: Next:
|
The Partition Based Solution The partition based algorithm addresses the issue of time complexity of the previously described simple algorithms. The idea is to first partition the data space, and then prune the partitions as soon as it can be determined that the partition doesn't contain any outlier. After that, only a subset of original partitions with much fewer points will remain, which will go through similar algorithm as described in the previous page. As the pre processing step of finding candidate partition is done in the granularity of partitions, the overhead due to pre processing is pretty small. The steps in the algorithm is: 1. Generating Partitions. 2. Compute bounds on Dk for each points in each partition. 3. Identify candidate partitions containing outliers. 4. Compute outliers from points in candidate partitions. 1.Generating Partitions: Partitions are to be generated such that points which are close together are contained in the same partition. An appropriate clustering algorithm is used to get the partitions. As our main concern is time complexity, the algorithm uses BIRCH, described in [ZRL96], because its time complexity is linear of the input size. For a detailed description of BIRCH, please refer to [ZRL96] and [Sam89]. The algorithm can be tuned to produce the desired number of partitions. Each partition is represented by the MBR of the contained points. These partitions are kept into a R* Tree or any other spatial index tree. 2.Computing Bounds for Partitions: For each partition P, we have to compute two bounds P.lower and P.upper which have the following property: For each point p in P, P.lower ≤ Dk (p) ≤ P.Upper How to find out these bounds ? Well, the idea is quite simple. All you have to do is to find out some partitions closest to P, such that the total number of points in those partition is at least k. Then the MINDIST and MAXDIST (defined in concepts page ) between the partitions and P is used to calculate the bounds. Figure 1 will explain how MINDIST and MAXDIST are used for the calculation of P.lower.
An optimization like the simple Index Based algorithm is also possible while calculating the bounds. If at any moment, P.upper ≥ min(Dk value of a potential outlier), we can safely prune the whole partition P, because it's guaranteed not to contain any possible candidate point to become one of the n outliers. Also, if a node Q has a MAXDIST(P,Q) / MINDIST(P,Q) greater than the running values of the upper/lower bounds in P respectively, the algorithm can safely discard Q from considering for the estimation of upper/ lower bounds of P. You can easily derive this by the same reasoning we have used in the previous page . The algorithm can be described as below: Here 2 heaps are used. i) lowerHeap: Stores partitions in the decreasing order of MINDIST from partition P. The partition with the largest value of MINDIST is stored on the top of lowerHeap. ii) upperHeap: Stores partitions in the decreasing order of MAXDIST from partition P. The partition with the largest value of MAXDIST is stored on the top of upperHeap. Algorithm for Computing Upper & Lower Bounds for a Partition: Traverse the spatial tree containing the Partitions if you reach a leaf node denoting the a number of Partitions for each Partition Q in the leaf if ( MINDIST(P,Q) < current P.lower ) Insert Q into the lower_heap of p and also ensure that the number of partitions in lower_heap is minimal provided there are at least k points in the partitions in lower_heap. new estimate of P.lower = max( MINDIST ( P, Any Partiton R in lower_heap) ) if ( MAXDIST(P,Q) < current P.upper ) Insert Q into the upper_heap of p and also ensure that the number of partitions in upper_heap is minimal provided there are at least k points in the partitions in upper_heap. new estimate of P.upper = max( MAXDIST ( P, Any Partiton R in upper_heap) ) If ( current P.upper <= current Lower Bound of Dk for an outlier) return immediately as further bounds calculation with P is unnecessary, P won't be a candidate anyway. For each node in spatial tree If ( P.upper ≤ MAXDIST (P, Node) and P.upper ≤ MAXDIST (P, Node) ) prune the whole node from the bounds calculation of P return (P.upper and P.Lower)
3.Computing Candidate Partitions: Before describing the 4th step of the Partition based algorithm, just look at the algorithm above. Do you find something inconsistent ? Why red color font has been used at a certain line ? The algorithm above uses min(current Lower bound of Dk value for a candidate outlier), which is shown in red. Let us term this value as mindkdist. This minDkDist is calculated at the 4th step of the algorithm. mindkdist = Minimum bound of Dk value for a point to be an outlier This mindkdist is an important parameter that will be used in computing the candidate outliers. But how to estimate mindkdist ? Computing mindkdist: i) Take the m partitions P1, P2, . . . , Pm ,with the maximum values for P.lower, such that these m partitions contain at least n points. ii) minDkDist is the minimum P.lower value for the partitions. minDkDist = min { Pi. lower : 1 ≤i≤m }
Also we have already observed that, a partition will not be a candidate partition, i.e the partition is guaranteed not to contain any outlier, if the upper bound of its distance from the k-th NN of any point, P.upper falls below minDkDist. This rule is applied for candidate partition search. Why so much hype about minDkDist ? Because, minDkDist is an important parameter, based on which we may prune many partitions from our outlier search, leaving only a very small subset of the original points to be searched for outlier hunt. It speeds up the search process to a great extent. The computation of the candidate partition uses the following observations: 1. A partition P does not contain any outlier if the current upper bound for the Dk value for the points containing in the partition, P.Upper falls below minDkDist. 2. We can safely discard any partition, which is neither a candidate for containing any outlier, nor contains any point that may be one of the k neighbors of an outlier. We have already used Observation 1,which is just the extension of the Observation 1 in the previous page, at the step of finding the bounds for each partition. Let me explain a bit about observation 2. It is not sufficient to keep only the candidate partitions that may contain an outlier. Because, other partitions may have some points in those, which may not be outliers themselves, but may be a close neighbor of an outlier. If the partition contains a point that can be the k-th neighbor of an outlier, removing the partition will result in loss of information about the Dk value of the outlier, which will consequently result in loss of information regarding outlier detection. So, we must keep both the candidate partitions and their neighbor partitions, where the neighbor partition is a partition that may contain one of the k neighbors of each of the points in candidate partition . Figure 3 will demonstrate how to determine the neighbor partition of a given partition P.
Now, we can integrate the procedures described into the following algorithm that computes Candidate Partitions. for each Partition P compute the Lower and Upper bounds using Step 3 compute minDkDist in the following way : a. Take the m partitions P1, P2, . . . , Pm ,with the maximum values for P.lower, such that these m partitions contain at least n points. b. minDkDist is the minimum P.lower value for the partitions, i.e, minDkDist = min { Pi. lower : 1 ≤i≤m } For each partition P if ( P.upper ≥ minDkDist ) Add P to the set of candidate partition PSet For each partition Q If ( P.upper ≥ MINDIST( P,Q ) ) Add Q to the neighbor of P Output: A set of candidate partition Pset, and for each partition P in PSet, a list of neighbor partition. 4.Computing Outliers from Candidate Partitions: If the previous steps were properly performed, we have only a small subset of points in the candidate partition set PSet. We also have only a subset of the original points to search for the neighbors for each partition. Now the index based algorithm described in the previous page can be used for the computation of outliers. Performance: The Partition based algorithm scales well with both the data size and the data dimensionality. In most of the cases, it is much faster than the block nested join or index based join algorithm.
|
|||||||||||||
|
Prev: Two Straightforward Solutions Next: Applet |
|||||||||||||
|
|