Prev : outliers what are they Next: Concepts
|
Some Formal Definitions Of Outliers Different Approaches to define Outliers Exist: Statistical Approach: The statistics community approach is to model the data points to a statistical distribution. Once you get a model for the data points, the model itself will define precisely the outliers. While this approach seems okay, most of the time the underlying distribution is unknown. So it doesn't work in practice. Distance Based Definition: As proposed by Knorr and Ng in [KN98], the distance based definition is Simple and intuitive. A Point p in a data set is an outlier w.r.t parameters k and d if no more than k points in the data set are at a distance of d or less from p. Consider the following Figure as an example. d is defined as the radius of the circles. If k=1, then the green point is an outlier, and so is the yellow point. But the blue point is not an outlier.
Figure 1: Distance Based Definition However, if k=0, then only the Yellow point is an outlier. If k=4 then blue, green and yellow points are outliers. This definition of outliers is quite a popular
one. Nevertheless, problems do exist. b) How do you rank among the outliers ? In the example above, if k=4 then all yellow, green and blue points are outliers. But yellow is the 'stronger' outlier among these 3. Because yellow has no other data point in its neighborhood of radius d, while green has one, and blue has 4. But using the distance based definition, there is no way to differentiate among these 3 points.
k-th Nearest Neighbor Based Definition: As proposed by Ramaswami, Rastogi and Shim in [RRS00], this particular definition of outliers address the drawbacks of the distance based approach. The definition is : Let Dk(p) denote the distance between a point p and the k-th nearest neighbor of p. Given a k and n, a point p is an outlier if no more than n-1 other points in the data set have a higher value for Dk than p. Figure 2: k-th Nearest Neighbor Based Definition ( k=1, n=2). A and B are outliers. In figure 2, k=1 and n=2. So only the distance from the nearest neighbor is considered and only two outliers are to be returned. Point A and Point B has the highest distance from the nearest neighbor. The distance is denoted as Dk(A) and Dk(B) respectively. So, A and B are outliers for these set of points. There are several advantages if we define an outlier like this. 1. There is no need for the user to specify the distance parameter d. Instead, the user is to specify the number of outlier n that s/he wants to find. Specifying the number of outlier is much an easier task than specifying the distance in case of large database. 2. Using this definition, it is possible to rank the outliers based on their Dk(p) value. As large as the value of Dk(p) for a point p, the stronger the outlier p becomes. Because then p has fewer points in its vicinity. We may use any distance metric like Manhattan or Euclidean distance as applies to the particular domain of the data. In this tutorial, we'll first explore two relatively straightforward solution using this k-th NN based definition. Then we'll explore an efficient partition based algorithm for mining outliers from large data sets, which is proposed by Ramaswamy, Rastogi and Shim in [RRS00].
|
|
|
Prev : outliers what are they Next: Concepts |
|
|
|