收藏 分销(赏)

Deduplication数据重复删除.ppt

上传人:xrp****65 文档编号:10786610 上传时间:2025-06-14 格式:PPT 页数:16 大小:2.27MB 下载积分:10 金币
下载 相关 举报
Deduplication数据重复删除.ppt_第1页
第1页 / 共16页
Deduplication数据重复删除.ppt_第2页
第2页 / 共16页


点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Deduplication,CSCI 572:Information Retrieval and Search Engines,Summer 2010,Outline,What is,Deduplication,?,Importance,Challenges,Approaches,What are web duplicates?,The same page,referenced by different URLs,,,What are the differences?,URL host(virtual hosts),sometimes protocol,sometimes page name,etc.,What are web duplicates?,Near identical page,referenced by the same URLs,Google search for“search engines”,Google search for“search engines”,What are the differences?,Page is within some delta%similar to the other(where delta is a large number),but may differ in e.g.,adds,counters,timestamps,etc.,Why is it important to consider duplicates?,In search engines,URLs tell the crawlers where to go and how to navigate the information space,Ideally,given the webs scale and complexity,well give priority to crawl content that we havent already stored or seen before,Saves resources(on the crawler end,as well as the remote host),Increases crawler politeness,Reduces the analysis that well have to do later,Why is it important to consider duplicates?,Identification of website mirrors(or copies of content)used to spread the load andbandwidth consumption,S,CPAN,Apache,etc.,If you identify a mirror,you canomit crawling,many,web pagesand save crawler resources,“More Like This”,Finding similarcontent to whatyou were lookingfor,As we discussedduring the lecture on the search engine architecture,much of the time in search engines is spent filtering through the results.Presenting similar documents can cut down on that filtering time,XML,XML documents,structurally appear,very,similar,Whats the difference between RSS and RDF and OWL and XSL and XSLT and any number of XML documents out there?,With the ability to identify similarity and reduce duplication of XML,we could identify XML documents with similar structure,RSS feeds that contain the same links,Differentiate RSS(crawl more often)from other less frequently updated XML,Detect Plagiarism,Determine web sites and reportsthat plagiarize one another,Important for copyright lawsand enforcement,Determine similarity betweensource code,Licensing issues,Open Source,other.,Detection of SPAM,Identifying malicious SPAM content,Adult sites,Pharmaceutical drug and prescriptiondrug SPAM,Malware and phishing scams,Need to ignore this content from a crawling perspective,Or to“flag”it and not include it in(general)search results,Challenges,Scalability,Most approaches to detecting duplicates rely on training and analytical approaches that may be computationally expensive,Challenge is to perform the evaluation at low cost,What to do with the duplicates?,The answer isnt always throw them out they may be useful for study,The content may require indexing for later comparison in legal issues,or for“,snapshot”ing,the web at the time i.e.,the Internet Archive,Challenges,Structure versus Semantics,Documents that are structurally dissimilar may content the,exact,same content,Think the use of tags to emphasize versus tags in HTML,Need to take this into account,Online versus offline,Depends on crawling strategy,but offline typically can provide more precision at the cost of inability to dynamically react,Approaches for,Deduplication,SIMHASH and Hamming Distance,Treat web documents as a set of features,constituting an,n,dimension vector transform this vector into an,f,-bit fingerprint of a small size,e.g.,64,Compare fingerprints and look for difference in at most,k,bits,Manku,et al.,WWW 2007,Syntactic similarity,Shingling,Treat web documents as continuous subsequence of words,Compute,w,-shingling,Border et al.,WWW 1997,Approaches for,Deduplication,Link structure similarity,Identify similar in the linkages between web collections,Choo,et al.,Approaches for,Deduplication,Exploiting the structure and links between physical network hosts,Look at:,Language,Geographical connection,Continuations and proxies,Zipifan,function,Bharat et al.,ICDM 2001,Wrapup,Need,Deduplication,for conserving resources and ensuring quality and accuracy of resultant search indices,Can assist in other areas like plagiarism,SPAM detection,fraud detection,etc.,Deduplication,at web scale is difficult,need efficient means to perform this computation online or offline,Techniques look at page structure/content,page link structure content,or physical web node structure,
展开阅读全文

开通  VIP会员、SVIP会员  优惠大
下载10份以上建议开通VIP会员
下载20份以上建议开通SVIP会员


开通VIP      成为共赢上传

当前位置:首页 > 包罗万象 > 大杂烩

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        抽奖活动

©2010-2025 宁波自信网络信息技术有限公司  版权所有

客服电话:0574-28810668  投诉电话:18658249818

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :微信公众号    抖音    微博    LOFTER 

客服