ImageVerifierCode 换一换
格式:PPT , 页数:44 ,大小:5.22MB ,
资源ID:1635293      下载积分:12 金币
快捷注册下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

开通VIP
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.zixin.com.cn/docdown/1635293.html】到电脑端继续下载(重复下载【60天内】不扣币)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录  

开通VIP折扣优惠下载文档

            查看会员权益                  [ 下载后找不到文档?]

填表反馈(24小时):  下载求助     关注领币    退款申请

开具发票请登录PC端进行申请

   平台协调中心        【在线客服】        免费申请共赢上传

权利声明

1、咨信平台为文档C2C交易模式,即用户上传的文档直接被用户下载,收益归上传人(含作者)所有;本站仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿,我们不确定上传用户享有完全著作权,根据《信息网络传播权保护条例》,如果侵犯了您的版权、权益或隐私,请联系我们,核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确),网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据,个别因单元格分列造成显示页码不一将协商解决,平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺,下载前须认真查看,确认无误后再购买,务必慎重购买;若有违法违纪将进行移交司法处理,若涉侵权平台将进行基本处罚并下架。
3、本站所有内容均由用户上传,付费前请自行鉴别,如您付费,意味着您已接受本站规则且自行承担风险,本站不进行额外附加服务,虚拟产品一经售出概不退款(未进行购买下载可退充值款),文档一经付费(服务费)、不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
4、如你看到网页展示的文档有www.zixin.com.cn水印,是因预览和防盗链等技术需要对页面进行转换压缩成图而已,我们并不对上传的文档进行任何编辑或修改,文档下载后都不会有水印标识(原文档上传前个别存留的除外),下载后原文更清晰;试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓;PPT和DOC文档可被视为“模板”,允许上传人保留章节、目录结构的情况下删减部份的内容;PDF文档不管是原文档转换或图片扫描而得,本站不作要求视为允许,下载前可先查看【教您几个在下载文档中可以更好的避免被坑】。
5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用;网站提供的党政主题相关内容(国旗、国徽、党徽--等)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
6、文档遇到问题,请及时联系平台进行协调解决,联系【微信客服】、【QQ客服】,若有其他问题请点击或扫码反馈【服务填表】;文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“【版权申诉】”,意见反馈和侵权处理邮箱:1219186828@qq.com;也可以拔打客服电话:0574-28810668;投诉电话:18658249818。

注意事项

本文(数据挖掘之异常检测.ppt)为本站上传会员【a199****6536】主动上传,咨信网仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知咨信网(发送邮件至1219186828@qq.com、拔打电话4009-655-100或【 微信客服】、【 QQ客服】),核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载【60天内】不扣币。 服务填表

数据挖掘之异常检测.ppt

1、Anomaly Detection:A introduction Source of slides:Tutorial At American Statistical Association(ASA2008)Jiawei Han-data mining:concepts and techniquesTutorial at the European Conference on Principles and Practice of Knowledge Discovery in DatabasesSpeaker:Wentao LiOutlineDefinitionApplicationMethodsL

2、imited time,So I just draw the picture of anomaly detection,for more detail,please turn to the paper for help.What are Anomalies?Anomaly is a pattern in the data that does not conform to the expected behaviorAnomaly is A data object that deviates significantly from the normal objects as if it were g

3、enerated by a different mechanismAlso referred to as outliers,exceptions,peculiarities,surprises,etc.Anomalies translate to significant(often critical)real life entitiesCyber intrusionsCredit card fraudFaults in mechanical systemsRelated problemsOutliers are different from the noise data Noise is ra

4、ndom error or variance in a measured variableNoise should be removed before outlier detectionOutliers are interesting:It violates the mechanism that generates the normal dataOutlier detection vs.novelty detection:early stage,outlier;but later merged into the modelKey ChallengesDefining a representat

5、ive normal region is challengingThe boundary between normal and outlying behavior is often not preciseAvailability of labeled data for training/validationThe exact notion of an outlier is different for different application domainsData might contain noiseNormal behavior keeps evolvingAppropriate sel

6、ection of relevant featuresMapRelated areas(theory)Application(practice)Problem formulationDetection effect+Aspects of Anomaly Detection ProblemNature of input data What is the characteristic of input dataAvailability of supervision Number of labelType of anomaly:point,contextual,structuralType of a

7、nomaly Output of anomaly detection Score vs labelEvaluation of anomaly detection techniques What kind of detection is goodInput DataMost common form of data handled by anomaly detection techniques is Record DataUnivariateMultivariateInput DataMost common form of data handled by anomaly detection tec

8、hniques is Record DataUnivariateMultivariateInput Data Nature of AttributesNature of attributesBinaryCategoricalContinuousHybridcategoricalcontinuouscontinuouscategoricalbinaryInput Data Complex Data TypesRelationship among data instancesSequential TemporalSpatialSpatio-temporalGraphData LabelsSuper

9、vised Anomaly DetectionLabels available for both normal data and anomaliesSemi-supervised Anomaly DetectionLabels available only for normal dataUnsupervised Anomaly DetectionNo labels assumedBased on the assumption that anomalies are very rare compared to normal dataPay attention:here some materials

10、 give different descriptions,and we treat adopt the definition here though it is a bit ambiguous with the traditional definitionalType of Anomalies*Point AnomaliesContextual AnomaliesCollective AnomaliesPoint AnomaliesAn individual data instance is anomalous w.r.t.the dataXYN1N2o1o2O3Contextual Anom

11、aliesAn individual data instance is anomalous within a contextRequires a notion of contextAlso referred to as conditional anomalies*Dangerous+theft condition=theftMoney consumer:the poor and the rich*Xiuyao Song,Mingxi Wu,Christopher Jermaine,Sanjay Ranka,Conditional Anomaly Detection,IEEE Transacti

12、ons on Data and Knowledge Engineering,2006.NormalAnomalyCollective AnomaliesA collection of related data instances is anomalousRequires a relationship among data instancesSequential DataSpatial DataGraph DataThe individual instances within a collective anomaly are not anomalous by themselvesAnomalou

13、s SubsequenceOutput of Anomaly DetectionLabelEach test instance is given a normal or anomaly labelThis is especially true of classification-based approachesScoreEach test instance is assigned an anomaly scoreAllows the output to be rankedRequires an additional threshold parameterEvaluation of Anomal

14、y Detection F-valueAccuracy is not sufficient metric for evaluationExample:network traffic data set with 99.9%of normal data and 0.1%of intrusionsTrivial classifier that labels everything with the normal class can achieve 99.9%accuracy!anomaly class Cnormal class NCFocus on both recall and precision

15、Recall (R)=TP/(TP+FN)true predicted anomaly/all anomalyPrecision(P)=TP/(TP+FP)true predicted anomaly/all predictedF measure=2*R*P/(R+P)=Evaluation of Outlier Detection ROC&AUCStandard measures for evaluating anomaly detection problems:Recall(Detection rate)-ratio between the number of correctly dete

16、cted anomalies and the total number of anomaliesFalse alarm(false positive)rate ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class ROC Curve is a trade-off between detection rate and false alarm rateAr

17、ea under the ROC curve(AUC)is computed using a trapezoid ruleThe best:|_ the worest:_|anomaly class Cnormal class NCAUCIdeal ROC curveApplications of Anomaly DetectionNetwork intrusion detectionInsurance/Credit card fraud detectionHealthcare Informatics/Medical diagnosticsIndustrial Damage Detection

18、Image Processing/Video surveillance Novel Topic Detection in Text MiningFraud DetectionFraud detection refers to detection of criminal activities occurring in commercial organizationsMalicious users might be the actual customers of the organization or might be posing as a customer(also known as iden

19、tity theft).Types of fraudCredit card fraudInsurance claim fraudMobile/cell phone fraudInsider tradingChallengesFast and accurate real-time detectionMisclassification cost is very highHealthcare InformaticsDetect anomalous patient recordsIndicate disease outbreaks,instrumentation errors,etc.Key Chal

20、lengesOnly normal labels availableMisclassification cost is very highData can be complex:spatio-temporalImage ProcessingDetecting outliers in a image or video monitored over timeDetecting anomalous regions within an imageUsed in mammography image analysisvideo surveillance satellite image analysisKe

21、y ChallengesDetecting collective anomaliesData sets are very largeAnomalyTaxonomy*Anomaly DetectionContextual Anomaly DetectionCollective Anomaly DetectionOnline Anomaly DetectionDistributed Anomaly DetectionPoint Anomaly DetectionClassification BasedRule BasedNeural Networks BasedSVM BasedNearest N

22、eighbor BasedDensity BasedDistance BasedStatisticalParametricNon-parametricClustering BasedOthersInformation Theory BasedSpectral Decomposition BasedVisualization BasedStatistical ApproachesStatistical approaches assume that the objects in a data set are generated by a stochastic process(a generativ

23、e model)Idea:learn a generative model fitting the given data set,and then identify the objects in low probability regions of the model as outliersMethods are divided into two categories:parametric vs.non-parametric Parametric methodAssumes that the normal data is generated by a parametric distributi

24、on with parameter The probability density function of the parametric distribution f(x,)gives the probability that object x is generated by the distributionThe smaller this value,the more likely x is an outlierNon-parametric methodNot assume an a-priori statistical model and determine the model from

25、the input dataNot completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advanceExamples:histogram and kernel density estimationParametric Methods I:Detection Univariate Outliers Based on Normal DistributionUnivariate data:A data set involving on

26、ly one attribute or variableOften assume that data are generated from a normal distribution,learn the parameters from the input data,and identify the points with low probability as outliersEx:Avg.temp.:24.0,28.9,28.9,29.0,29.1,29.1,29.2,29.2,29.3,29.4Use the maximum likelihood method to estimate and

27、 nTaking derivatives with respect to and 2,we derive the following maximum likelihood estimatesnFor the above data with n=10,we havenThen(24 28.61)/1.51=3.04 rEfficient computation:Nested loop algorithmFor any object oi,calculate its distance from other objects,and count the#of other objects in the

28、r-neighborhood.If n other objects are within r distance,terminate the inner loopOtherwise,oi is a DB(r,)outlierEfficiency:Actually CPU time is not O(n2)but linear to the data set size since for most non-outlier objects,the inner loop terminates early35Density-Based Outlier DetectionLocal outliers:Ou

29、tliers comparing to their local neighborhoods,instead of the global data distributionIn Fig.,o1 and o2 are local outliers to C1,o3 is a global outlier,but o4 is not an outlier.However,proximity-based clustering cannot find o1 and o2 are outlier(e.g.,comparing with O4).36nIntuition(density-based outl

30、ier detection):The density around an outlier object is significantly different from the density around its neighborsnMethod:Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliersnk-distance of an object o,distk(o):distance between o an

31、d its k-th NNnk-distance neighborhood of o,Nk(o)=o|o in D,dist(o,o)distk(o)nNk(o)could be bigger than k since multiple objects may have identical distance to oLocal Outlier Factor:LOFReachability distance from o to o:where k is a user-specified parameterLocal reachability density of o:37nLOF(Local o

32、utlier factor)of an object o is the average of the ratio of local reachability of o and those of os k-nearest neighborsnThe lower the local reachability density of o,and the higher the local reachability density of the kNN of o,the higher LOFnThis captures a local outlier whose local density is rela

33、tively low comparing to the local densities of its kNNClustering-Based Outlier Detection(1&2):Not belong to any cluster,or far from the closest oneAn object is an outlier if(1)it does not belong to any cluster,(2)there is a large distance between the object and its closest cluster,or(3)it belongs to

34、 a small or sparse cluster nCase I:Not belong to any clusternIdentify animals not part of a flock:Using a density-based clustering method such as DBSCANnCase 2:Far from its closest cluster nUsing k-means,partition data points of into clusters nFor each object o,assign an outlier score based on its d

35、istance from its closest center nIf dist(o,co)/avg_dist(co)is large,likely an outliernEx.Intrusion detection:Consider the similarity between data points and the clusters in a training data setnUse a training set to find patterns of“normal”data,e.g.,frequent itemsets in each segment,and cluster simil

36、ar connections into groupsnCompare new data points with the clusters minedOutliers are possible attacks39FindCBLOF:Detect outliers in small clustersFind clusters,and sort them in decreasing sizeTo each data point,assign a cluster-based local outlier factor(CBLOF):If obj p belongs to a large cluster,

37、CBLOF=cluster_size X similarity between p and clusterIf p belongs to a small one,CBLOF=cluster size X similarity betw.p and the closest large cluster40Clustering-Based Outlier Detection(3):Detecting Outliers in Small ClustersnEx.In the figure,o is outlier since its closest large cluster is C1,but th

38、e similarity between o and C1 is small.For any point in C3,its closest large cluster is C2 but its similarity from C2 is low,plus|C3|=3 is smallClustering-Based Method:Strength and WeaknessStrengthDetect outliers without requiring any labeled data Work for many types of dataClusters can be regarded

39、as summaries of the dataOnce the cluster are obtained,need only compare any object against the clusters to determine whether it is an outlier(fast)WeaknessEffectiveness depends highly on the clustering method usedthey may not be optimized for outlier detectionHigh computational cost:Need to first fi

40、nd clustersA method to reduce the cost:Fixed-width clusteringA point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the pointIf a point cannot be assigned to any existing cluster,a new cluster is created and the distance threshold may be learned

41、 from the training data under certain conditionsClassification-Based Method I:One-Class ModelIdea:Train a classification model that can distinguish“normal”data from outliersA brute-force approach:Consider a training set that contains samples labeled as“normal”and others labeled as“outlier”But,the tr

42、aining set is typically heavily biased:#of“normal”samples likely far exceeds#of outlier samplesCannot detect unseen anomaly43nOne-class model:A classifier is built to describe only the normal class.nLearn the decision boundary of the normal class using classification methods such as SVMnAny samples

43、that do not belong to the normal class(not within the decision boundary)are declared as outliersnAdv:can detect new outliers that may not appear close to any outlier objects in the training setnExtension:Normal objects may belong to multiple classesClassification-Based Method II:Semi-Supervised Lear

44、ningSemi-supervised learning:Combining classification-based and clustering-based methodsMethodUsing a clustering-based approach,find a large cluster,C,and a small cluster,C1Since some objects in C carry the label“normal”,treat all objects in C as normalUse the one-class model of this cluster to iden

45、tify normal objects in outlier detectionSince some objects in cluster C1 carry the label“outlier”,declare all objects in C1 as outliersAny object that does not fall into the model for C(such as a)is considered an outlier as well44nComments on classification-based outlier detection methodsnStrength:Outlier detection is fastnBottleneck:Quality heavily depends on the availability and quality of the training set,but often difficult to obtain representative and high-quality training data

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        抽奖活动

©2010-2026 宁波自信网络信息技术有限公司  版权所有

客服电话:0574-28810668  投诉电话:18658249818

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :微信公众号    抖音    微博    LOFTER 

客服