数据挖掘之异常检测.ppt_咨信网zixin.com.cn

资源描述

Anomaly Detection:A introduction Source of slides:Tutorial At American Statistical Association(ASA2008)Jiawei Han-data mining:concepts and techniquesTutorial at the European Conference on Principles and Practice of Knowledge Discovery in DatabasesSpeaker:Wentao LiOutlineDefinitionApplicationMethodsLimited time,So I just draw the picture of anomaly detection,for more detail,please turn to the paper for help.What are Anomalies?Anomaly is a pattern in the data that does not conform to the expected behaviorAnomaly is A data object that deviates significantly from the normal objects as if it were generated by a different mechanismAlso referred to as outliers,exceptions,peculiarities,surprises,etc.Anomalies translate to significant(often critical)real life entitiesCyber intrusionsCredit card fraudFaults in mechanical systemsRelated problemsOutliers are different from the noise data Noise is random error or variance in a measured variableNoise should be removed before outlier detectionOutliers are interesting:It violates the mechanism that generates the normal dataOutlier detection vs.novelty detection:early stage,outlier;but later merged into the modelKey ChallengesDefining a representative normal region is challengingThe boundary between normal and outlying behavior is often not preciseAvailability of labeled data for training/validationThe exact notion of an outlier is different for different application domainsData might contain noiseNormal behavior keeps evolvingAppropriate selection of relevant featuresMapRelated areas(theory)Application(practice)Problem formulationDetection effect+Aspects of Anomaly Detection ProblemNature of input data What is the characteristic of input dataAvailability of supervision Number of labelType of anomaly:point,contextual,structuralType of anomaly Output of anomaly detection Score vs labelEvaluation of anomaly detection techniques What kind of detection is goodInput DataMost common form of data handled by anomaly detection techniques is Record DataUnivariateMultivariateInput DataMost common form of data handled by anomaly detection techniques is Record DataUnivariateMultivariateInput Data Nature of AttributesNature of attributesBinaryCategoricalContinuousHybridcategoricalcontinuouscontinuouscategoricalbinaryInput Data Complex Data TypesRelationship among data instancesSequential TemporalSpatialSpatio-temporalGraphData LabelsSupervised Anomaly DetectionLabels available for both normal data and anomaliesSemi-supervised Anomaly DetectionLabels available only for normal dataUnsupervised Anomaly DetectionNo labels assumedBased on the assumption that anomalies are very rare compared to normal dataPay attention:here some materials give different descriptions,and we treat adopt the definition here though it is a bit ambiguous with the traditional definitionalType of Anomalies*Point AnomaliesContextual AnomaliesCollective AnomaliesPoint AnomaliesAn individual data instance is anomalous w.r.t.the dataXYN1N2o1o2O3Contextual AnomaliesAn individual data instance is anomalous within a contextRequires a notion of contextAlso referred to as conditional anomalies*Dangerous+theft condition=theftMoney consumer:the poor and the rich*Xiuyao Song,Mingxi Wu,Christopher Jermaine,Sanjay Ranka,Conditional Anomaly Detection,IEEE Transactions on Data and Knowledge Engineering,2006.NormalAnomalyCollective AnomaliesA collection of related data instances is anomalousRequires a relationship among data instancesSequential DataSpatial DataGraph DataThe individual instances within a collective anomaly are not anomalous by themselvesAnomalous SubsequenceOutput of Anomaly DetectionLabelEach test instance is given a normal or anomaly labelThis is especially true of classification-based approachesScoreEach test instance is assigned an anomaly scoreAllows the output to be rankedRequires an additional threshold parameterEvaluation of Anomaly Detection F-valueAccuracy is not sufficient metric for evaluationExample:network traffic data set with 99.9%of normal data and 0.1%of intrusionsTrivial classifier that labels everything with the normal class can achieve 99.9%accuracy!anomaly class Cnormal class NCFocus on both recall and precisionRecall (R)=TP/(TP+FN)true predicted anomaly/all anomalyPrecision(P)=TP/(TP+FP)true predicted anomaly/all predictedF measure=2*R*P/(R+P)=Evaluation of Outlier Detection ROC&AUCStandard measures for evaluating anomaly detection problems:Recall(Detection rate)-ratio between the number of correctly detected anomalies and the total number of anomaliesFalse alarm(false positive)rate ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class ROC Curve is a trade-off between detection rate and false alarm rateArea under the ROC curve(AUC)is computed using a trapezoid ruleThe best:|_ the worest:_|anomaly class Cnormal class NCAUCIdeal ROC curveApplications of Anomaly DetectionNetwork intrusion detectionInsurance/Credit card fraud detectionHealthcare Informatics/Medical diagnosticsIndustrial Damage DetectionImage Processing/Video surveillance Novel Topic Detection in Text MiningFraud DetectionFraud detection refers to detection of criminal activities occurring in commercial organizationsMalicious users might be the actual customers of the organization or might be posing as a customer(also known as identity theft).Types of fraudCredit card fraudInsurance claim fraudMobile/cell phone fraudInsider tradingChallengesFast and accurate real-time detectionMisclassification cost is very highHealthcare InformaticsDetect anomalous patient recordsIndicate disease outbreaks,instrumentation errors,etc.Key ChallengesOnly normal labels availableMisclassification cost is very highData can be complex:spatio-temporalImage ProcessingDetecting outliers in a image or video monitored over timeDetecting anomalous regions within an imageUsed in mammography image analysisvideo surveillance satellite image analysisKey ChallengesDetecting collective anomaliesData sets are very largeAnomalyTaxonomy*Anomaly DetectionContextual Anomaly DetectionCollective Anomaly DetectionOnline Anomaly DetectionDistributed Anomaly DetectionPoint Anomaly DetectionClassification BasedRule BasedNeural Networks BasedSVM BasedNearest Neighbor BasedDensity BasedDistance BasedStatisticalParametricNon-parametricClustering BasedOthersInformation Theory BasedSpectral Decomposition BasedVisualization BasedStatistical ApproachesStatistical approaches assume that the objects in a data set are generated by a stochastic process(a generative model)Idea:learn a generative model fitting the given data set,and then identify the objects in low probability regions of the model as outliersMethods are divided into two categories:parametric vs.non-parametric Parametric methodAssumes that the normal data is generated by a parametric distribution with parameter The probability density function of the parametric distribution f(x,)gives the probability that object x is generated by the distributionThe smaller this value,the more likely x is an outlierNon-parametric methodNot assume an a-priori statistical model and determine the model from the input dataNot completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advanceExamples:histogram and kernel density estimationParametric Methods I:Detection Univariate Outliers Based on Normal DistributionUnivariate data:A data set involving only one attribute or variableOften assume that data are generated from a normal distribution,learn the parameters from the input data,and identify the points with low probability as outliersEx:Avg.temp.:24.0,28.9,28.9,29.0,29.1,29.1,29.2,29.2,29.3,29.4Use the maximum likelihood method to estimate and nTaking derivatives with respect to and 2,we derive the following maximum likelihood estimatesnFor the above data with n=10,we havenThen(24 28.61)/1.51=3.04 rEfficient computation:Nested loop algorithmFor any object oi,calculate its distance from other objects,and count the#of other objects in the r-neighborhood.If n other objects are within r distance,terminate the inner loopOtherwise,oi is a DB(r,)outlierEfficiency:Actually CPU time is not O(n2)but linear to the data set size since for most non-outlier objects,the inner loop terminates early35Density-Based Outlier DetectionLocal outliers:Outliers comparing to their local neighborhoods,instead of the global data distributionIn Fig.,o1 and o2 are local outliers to C1,o3 is a global outlier,but o4 is not an outlier.However,proximity-based clustering cannot find o1 and o2 are outlier(e.g.,comparing with O4).36nIntuition(density-based outlier detection):The density around an outlier object is significantly different from the density around its neighborsnMethod:Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliersnk-distance of an object o,distk(o):distance between o and its k-th NNnk-distance neighborhood of o,Nk(o)=o|o in D,dist(o,o)distk(o)nNk(o)could be bigger than k since multiple objects may have identical distance to oLocal Outlier Factor:LOFReachability distance from o to o:where k is a user-specified parameterLocal reachability density of o:37nLOF(Local outlier factor)of an object o is the average of the ratio of local reachability of o and those of os k-nearest neighborsnThe lower the local reachability density of o,and the higher the local reachability density of the kNN of o,the higher LOFnThis captures a local outlier whose local density is relatively low comparing to the local densities of its kNNClustering-Based Outlier Detection(1&2):Not belong to any cluster,or far from the closest oneAn object is an outlier if(1)it does not belong to any cluster,(2)there is a large distance between the object and its closest cluster,or(3)it belongs to a small or sparse cluster nCase I:Not belong to any clusternIdentify animals not part of a flock:Using a density-based clustering method such as DBSCANnCase 2:Far from its closest cluster nUsing k-means,partition data points of into clusters nFor each object o,assign an outlier score based on its distance from its closest center nIf dist(o,co)/avg_dist(co)is large,likely an outliernEx.Intrusion detection:Consider the similarity between data points and the clusters in a training data setnUse a training set to find patterns of“normal”data,e.g.,frequent itemsets in each segment,and cluster similar connections into groupsnCompare new data points with the clusters minedOutliers are possible attacks39FindCBLOF:Detect outliers in small clustersFind clusters,and sort them in decreasing sizeTo each data point,assign a cluster-based local outlier factor(CBLOF):If obj p belongs to a large cluster,CBLOF=cluster_size X similarity between p and clusterIf p belongs to a small one,CBLOF=cluster size X similarity betw.p and the closest large cluster40Clustering-Based Outlier Detection(3):Detecting Outliers in Small ClustersnEx.In the figure,o is outlier since its closest large cluster is C1,but the similarity between o and C1 is small.For any point in C3,its closest large cluster is C2 but its similarity from C2 is low,plus|C3|=3 is smallClustering-Based Method:Strength and WeaknessStrengthDetect outliers without requiring any labeled data Work for many types of dataClusters can be regarded as summaries of the dataOnce the cluster are obtained,need only compare any object against the clusters to determine whether it is an outlier(fast)WeaknessEffectiveness depends highly on the clustering method usedthey may not be optimized for outlier detectionHigh computational cost:Need to first find clustersA method to reduce the cost:Fixed-width clusteringA point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the pointIf a point cannot be assigned to any existing cluster,a new cluster is created and the distance threshold may be learned from the training data under certain conditionsClassification-Based Method I:One-Class ModelIdea:Train a classification model that can distinguish“normal”data from outliersA brute-force approach:Consider a training set that contains samples labeled as“normal”and others labeled as“outlier”But,the training set is typically heavily biased:#of“normal”samples likely far exceeds#of outlier samplesCannot detect unseen anomaly43nOne-class model:A classifier is built to describe only the normal class.nLearn the decision boundary of the normal class using classification methods such as SVMnAny samples that do not belong to the normal class(not within the decision boundary)are declared as outliersnAdv:can detect new outliers that may not appear close to any outlier objects in the training setnExtension:Normal objects may belong to multiple classesClassification-Based Method II:Semi-Supervised LearningSemi-supervised learning:Combining classification-based and clustering-based methodsMethodUsing a clustering-based approach,find a large cluster,C,and a small cluster,C1Since some objects in C carry the label“normal”,treat all objects in C as normalUse the one-class model of this cluster to identify normal objects in outlier detectionSince some objects in cluster C1 carry the label“outlier”,declare all objects in C1 as outliersAny object that does not fall into the model for C(such as a)is considered an outlier as well44nComments on classification-based outlier detection methodsnStrength:Outlier detection is fastnBottleneck:Quality heavily depends on the availability and quality of the training set,but often difficult to obtain representative and high-quality training data

展开阅读全文