资源描述
Richard ChowPhilippe GolleJessica StaddonPARCWeb-based Inference DetectionWeb 2.0 Security&Privacy,5/24/2007Declassified FBI ReportWeb search on:“sibling saudi magnate”ObservationsMost web pages with terms“sibling saudi magnate”also contain terms“osama bin laden”Hence,deduce the inference:sibling saudi magnate osama bin ladenGet most valid inferences,since the Web is a proxy for all human knowledgeNot complete though!Idea:Deduce inferences from co-occurrence of terms on the WebConceptual FrameworkConsider any Boolean formula of terms,e.g.(saudi AND magnate AND sibling),(osama AND bin AND laden)Evaluates to TRUE or FALSE for each Web pageOr,for each paragraph in each Web page.Strength of inference:Conditional ProbabilityGiven(PRECEDENT)is TRUE,what is probability that(CONSEQUENT)is TRUE?Write:(PRECEDENT)IMPLIES(CONSEQUENT)From now on,restrict to special case:Conjunction of terms implying another conjunction of termsOther cases may be of interest as well:(xxx)IMPLIES(Person1 OR Person2 OR)Traditional Association RulesProblem:Find market items that are commonly purchased togetherRules are of the form:(A)IMPLIES(B),A and B are sets of itemsLegendary example:(diapers)IMPLIES(beer)Confidence of a rule:Pr(B|A)Given that A is purchased,how likely is B to be purchased?Support of a rule:Pr(A and B)What portion of all purchases contain both A and B?Apriori(Agrawal et al):well-known algorithm for this problemWorks for given confidence and support cutoffsWeb Association RulesOur problem:Find terms that are commonly found together on web pages Key differences from traditional association rulesWeb is very large and unstructuredNatural Language Processing(NLP)may provide additional information since we are mining terms from textMore complex rules are of interestBoolean formulae such as(A)IMPLIES(B OR C)Linguistic patterns such as(a followed b)IMPLIES(C)Note that for privacy applications,need to find rules with very low supportApriori algorithm not directly usefulUsing search engines to estimate probabilitiesAnother WayProbability is about 81/234HIV Precision:Top 60 InferencesPrecision:fraction of“correct”inferences producedAnalyzed top precedents appearing in at least 100K documentsMedical expert reviewed these inferences28 were“correct”3 not necessarily connected to HIV,but were related conditions29 unknown or did not indicate HIVMedical expert appropriate for medical records-note that appropriate reviewer depends on the application“Montagnier”not considered“correct”,but was discoverer of the HIV virus“Kwazulu”not considered“correct”,but this province of SA has one of the highest HIV infection rates in the worldInference ProblemMore and more publicly available dataWeb 2.0 technologies becoming common“long tail of the Internet”How to control the release of data?What does the data reveal?Need automated techniquesScenarios:IndividualsAnonymous blogs or postingsRedaction of medical recordsCorporationsNews releasesIdentification of content representing riskGovernmentDeclassification of government documents
展开阅读全文