Partition Based Algorithm

The Partition Based Solution

The partition based algorithm addresses the issue of time complexity of the previously described simple algorithms.

The idea is to first partition the data space, and then prune the partitions as soon as it can be determined that the partition doesn't contain any outlier. After that, only a subset of original partitions with much fewer points will remain, which will go through similar algorithm as described in the previous page. As the pre processing step of finding candidate partition is done in the granularity of partitions, the overhead due to pre processing is pretty small.

The steps in the algorithm is:

1. Generating Partitions.

2. Compute bounds on D^k for each points in each partition.

3. Identify candidate partitions containing outliers.

4. Compute outliers from points in candidate partitions.

1.Generating Partitions:

Partitions are to be generated such that points which are close together are contained in the same partition. An appropriate clustering algorithm is used to get the partitions. As our main concern is time complexity, the algorithm uses BIRCH, described in [ZRL96], because its time complexity is linear of the input size. For a detailed description of BIRCH, please refer to [ZRL96] and [Sam89]. The algorithm can be tuned to produce the desired number of partitions. Each partition is represented by the MBR of the contained points. These partitions are kept into a R* Tree or any other spatial index tree.

2.Computing Bounds for Partitions:

For each partition P, we have to compute two bounds P.lower and P.upper which have the following property:

For each point p in P, P.lower ≤ D^k (p) ≤ P.Upper

How to find out these bounds ? Well, the idea is quite simple. All you have to do is to find out some partitions closest to P, such that the total number of points in those partition is at least k. Then the MINDIST and MAXDIST (defined in concepts page ) between the partitions and P is used to calculate the bounds. Figure 1 will explain how MINDIST and MAXDIST are used for the calculation of P.lower.

Let us assume that k=11. The partition Q is the closest partition from partition P. Q has 5 points, shown in red. P has 6 points, shown in blue. As the sum of the number of points in P and Q ≥ k, it is apparent that the D^k(p) of any point p in P will be at least as much as MINDIST( P, Q ). It is also clear that the maximum distance between any two points in P and Q is bounded by MAXDIST(P , Q), as no two points in P or Q is not at a more distance from each other than the MAXDIST (P,Q).

Figure 1: Using MINDIST and MAXDIST to approximate P.lower and P.upper respectively

An optimization like the simple Index Based algorithm is also possible while calculating the bounds. If at any moment, P.upper ≥ min(D^kvalue of a potential outlier), we can safely prune the whole partition P, because it's guaranteed not to contain any possible candidate point to become one of the n outliers. Also, if a node Q has a MAXDIST(P,Q) / MINDIST(P,Q) greater than the running values of the upper/lower bounds in P respectively, the algorithm can safely discard Q from considering for the estimation of upper/ lower bounds of P. You can easily derive this by the same reasoning we have used in the previous page .

The algorithm can be described as below:

Here 2 heaps are used.

i) lowerHeap: Stores partitions in the decreasing order of MINDIST from partition P. The partition with the largest value of MINDIST is stored on the top of lowerHeap.

ii) upperHeap: Stores partitions in the decreasing order of MAXDIST from partition P. The partition with the largest value of MAXDIST is stored on the top of upperHeap.

Algorithm for Computing Upper & Lower Bounds for a Partition:

Traverse the spatial tree containing the Partitions

if you reach a leaf node denoting the a number of Partitions

for each Partition Q in the leaf

if ( MINDIST(P,Q) < current P.lower )

Insert Q into the lower_heap of p and also ensure that the number of partitions in

lower_heap is minimal provided there are at least k points in the partitions in

lower_heap.

new estimate of P.lower = max( MINDIST ( P, Any Partiton R in lower_heap) )

if ( MAXDIST(P,Q) < current P.upper )

Insert Q into the upper_heap of p and also ensure that the number of partitions in

upper_heap is minimal provided there are at least k points in the partitions in

upper_heap.

new estimate of P.upper = max( MAXDIST ( P, Any Partiton R in upper_heap) )

If ( current P.upper <= current Lower Bound of D^k for an outlier)

return immediately as further bounds calculation with P is unnecessary, P

won't be a candidate anyway.

For each node in spatial tree

If ( P.upper ≤ MAXDIST (P, Node) and P.upper ≤ MAXDIST (P, Node) )

prune the whole node from the bounds calculation of P

return (P.upper and P.Lower)

3.Computing Candidate Partitions:

Before describing the 4th step of the Partition based algorithm, just look at the algorithm above. Do you find something inconsistent ? Why red color font has been used at a certain line ?

The algorithm above uses min(current Lower bound of D^kvalue for a candidate outlier), which is shown in red. Let us term this value as mindkdist. This minDkDist is calculated at the 4th step of the algorithm.

mindkdist = Minimum bound of D^k value for a point to be an outlier

This mindkdist is an important parameter that will be used in computing the candidate outliers. But how to estimate mindkdist ?

Computing mindkdist:

i) Take the m partitions P₁, P₂, . . . , P_m,with the maximum values for P.lower, such that these m partitions contain at least n points.

ii) minDkDist is the minimum P.lower value for the partitions.

minDkDist = min { P_i. lower : 1 ≤i≤m }

Let us assume that the points are partitioned into 4 partition A, B, C and D. As shown in Figure 2. Lets first assume that n = 3, i.e we need 3 outliers.

The outliers must have high value for D^k distance. As n=3, we only need to take 2 partitions B and D, that has the biggest value for P.lower into consideration.

Now, as you see B.lower= 7 and D.lower= 10. So, any point which has a D^k less than 7 cannot be an outlier for sure. So, minDkDist = 7 in this case.

However, if n = 7, then B, C and D has to be considered as the number of points in B, C and D together exceeds n. Then minDkdist = 3.

Figure 2: Computing the minDkDist

Also we have already observed that, a partition will not be a candidate partition, i.e the partition is guaranteed not to contain any outlier, if the upper bound of its distance from the k-th NN of any point, P.upper falls below minDkDist. This rule is applied for candidate partition search.

Why so much hype about minDkDist ? Because, minDkDist is an important parameter, based on which we may prune many partitions from our outlier search, leaving only a very small subset of the original points to be searched for outlier hunt. It speeds up the search process to a great extent.

The computation of the candidate partition uses the following observations:

1. A partition P does not contain any outlier if the current upper bound for the D^k value for the points containing in the partition, P.Upper falls below minDkDist.

2. We can safely discard any partition, which is neither a candidate for containing any outlier, nor contains any point that may be one of the k neighbors of an outlier.

We have already used Observation 1,which is just the extension of the Observation 1 in the previous page, at the step of finding the bounds for each partition. Let me explain a bit about observation 2.

It is not sufficient to keep only the candidate partitions that may contain an outlier. Because, other partitions may have some points in those, which may not be outliers themselves, but may be a close neighbor of an outlier. If the partition contains a point that can be the k-th neighbor of an outlier, removing the partition will result in loss of information about the D^k value of the outlier, which will consequently result in loss of information regarding outlier detection. So, we must keep both the candidate partitions and their neighbor partitions, where the neighbor partition is a partition that may contain one of the k neighbors of each of the points in candidate partition . Figure 3 will demonstrate how to determine the neighbor partition of a given partition P.

In this figure, Partition P has P.upper = 5. It means for any point inside P, the maximum distance of the k-th nearest neighbor of P will not exceed 5 units. The rounded rectangle is the 'region of influence' of P, i.e, the region which may contain the k-th nearest neighbor of any point inside P. The minimum distance between the any point in P and any point on the boundary of influence is 5, as P.upper = 5..

Partition R doesn't overlap with the region of influence. So, it is obvious that partition R doesn't contain any point that may be the k-th neighbor of any point in P. We can see that, R is not a neighbor of P as MINDIST(P,R) > P.upper.

On the other hand, partition Q overlaps with the region of influence of P. So, it can contain some points that'll be the k-th NN of a point in P.

We can conclude that, Partition Q is a neighbor of Partition P, iff MINDIST(P,Q) < P.upper.

Figure 3: Finding a neighbor partition of Partition P

Now, we can integrate the procedures described into the following algorithm that computes Candidate Partitions.

for each Partition P

compute the Lower and Upper bounds using Step 3

compute minDkDist in the following way :

a. Take the m partitions P₁, P₂, . . . , P_m ,with the maximum values for P.lower, such that these m partitions contain at least n points.

b. minDkDist is the minimum P.lower value for the partitions, i.e, minDkDist = min { P_i. lower : 1 ≤i≤m }

For each partition P

if ( P.upper ≥ minDkDist )

Add P to the set of candidate partition PSet

For each partition Q

If ( P.upper ≥ MINDIST( P,Q ) )

Add Q to the neighbor of P

Output: A set of candidate partition Pset, and for each partition P in PSet, a list of neighbor partition.

4.Computing Outliers from Candidate Partitions:

If the previous steps were properly performed, we have only a small subset of points in the candidate partition set PSet. We also have only a subset of the original points to search for the neighbors for each partition. Now the index based algorithm described in the previous page can be used for the computation of outliers.

Performance:

The Partition based algorithm scales well with both the data size and the data dimensionality. In most of the cases, it is much faster than the block nested join or index based join algorithm.

Prev: Two Straightforward Solutions Next: Applet

Fireworks Photo Caption