Deduplication数据重复删除.ppt

资源描述

Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Deduplication,CSCI 572:Information Retrieval and Search Engines,Summer 2010,Outline,What is,Deduplication,?,Importance,Challenges,Approaches,What are web duplicates?,The same page,referenced by different URLs,,,What are the differences?,URL host(virtual hosts),sometimes protocol,sometimes page name,etc.,What are web duplicates?,Near identical page,referenced by the same URLs,Google search for“search engines”,Google search for“search engines”,What are the differences?,Page is within some delta%similar to the other(where delta is a large number),but may differ in e.g.,adds,counters,timestamps,etc.,Why is it important to consider duplicates?,In search engines,URLs tell the crawlers where to go and how to navigate the information space,Ideally,given the webs scale and complexity,well give priority to crawl content that we havent already stored or seen before,Saves resources(on the crawler end,as well as the remote host),Increases crawler politeness,Reduces the analysis that well have to do later,Why is it important to consider duplicates?,Identification of website mirrors(or copies of content)used to spread the load andbandwidth consumption,S,CPAN,Apache,etc.,If you identify a mirror,you canomit crawling,many,web pagesand save crawler resources,“More Like This”,Finding similarcontent to whatyou were lookingfor,As we discussedduring the lecture on the search engine architecture,much of the time in search engines is spent filtering through the results.Presenting similar documents can cut down on that filtering time,XML,XML documents,structurally appear,very,similar,Whats the difference between RSS and RDF and OWL and XSL and XSLT and any number of XML documents out there?,With the ability to identify similarity and reduce duplication of XML,we could identify XML documents with similar structure,RSS feeds that contain the same links,Differentiate RSS(crawl more often)from other less frequently updated XML,Detect Plagiarism,Determine web sites and reportsthat plagiarize one another,Important for copyright lawsand enforcement,Determine similarity betweensource code,Licensing issues,Open Source,other.,Detection of SPAM,Identifying malicious SPAM content,Adult sites,Pharmaceutical drug and prescriptiondrug SPAM,Malware and phishing scams,Need to ignore this content from a crawling perspective,Or to“flag”it and not include it in(general)search results,Challenges,Scalability,Most approaches to detecting duplicates rely on training and analytical approaches that may be computationally expensive,Challenge is to perform the evaluation at low cost,What to do with the duplicates?,The answer isnt always throw them out they may be useful for study,The content may require indexing for later comparison in legal issues,or for“,snapshot”ing,the web at the time i.e.,the Internet Archive,Challenges,Structure versus Semantics,Documents that are structurally dissimilar may content the,exact,same content,Think the use of tags to emphasize versus tags in HTML,Need to take this into account,Online versus offline,Depends on crawling strategy,but offline typically can provide more precision at the cost of inability to dynamically react,Approaches for,Deduplication,SIMHASH and Hamming Distance,Treat web documents as a set of features,constituting an,n,dimension vector transform this vector into an,f,-bit fingerprint of a small size,e.g.,64,Compare fingerprints and look for difference in at most,k,bits,Manku,et al.,WWW 2007,Syntactic similarity,Shingling,Treat web documents as continuous subsequence of words,Compute,w,-shingling,Border et al.,WWW 1997,Approaches for,Deduplication,Link structure similarity,Identify similar in the linkages between web collections,Choo,et al.,Approaches for,Deduplication,Exploiting the structure and links between physical network hosts,Look at:,Language,Geographical connection,Continuations and proxies,Zipifan,function,Bharat et al.,ICDM 2001,Wrapup,Need,Deduplication,for conserving resources and ensuring quality and accuracy of resultant search indices,Can assist in other areas like plagiarism,SPAM detection,fraud detection,etc.,Deduplication,at web scale is difficult,need efficient means to perform this computation online or offline,Techniques look at page structure/content,page link structure content,or physical web node structure,

展开阅读全文