收藏 分销(赏)

Python机器学习Kaggle案例实战.pdf

上传人:曲**** 文档编号:228928 上传时间:2023-03-18 格式:PDF 页数:31 大小:1.51MB
下载 相关 举报
Python机器学习Kaggle案例实战.pdf_第1页
第1页 / 共31页
Python机器学习Kaggle案例实战.pdf_第2页
第2页 / 共31页
点击查看更多>>
资源描述
ATAGURU血:炼卸脸Python机器学习Kaggle案例实战第1周-DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU法律声明【声明】本视频和幻灯片为炼数成金网络课程的教 学资料,所有资料只能在课程内使用,不得在课程以外范散播,违者将可能被追究法律和经济责任。课程详情访问炼数成金培训网站http:DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU关注炼数成金企业微信 提供全面的数据价值资讯,涵盖商业智能与数据分析.大数据、企业信息化.数字化 技术等,各种高性价比课程信息,赶紧掏出您的手机关注吧!DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪关于KaggleATAGURUHill炼数!脸 https: Home for Data ScienceKaggle helps you learn,work,and playCreate an account orHost a competitionCompetitions Climb the worlds most elite machine learning leaderboardsWant to host a competition?Datasets Explore and analyze a collection of high quality public datasetsKernels Run code in the cloud and receive community feedback on your workDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪I案例背景介绍riATAGURU Crowdflower Search Results Relevance https: 目前,小型在线企业没有很好的方法来评估其搜索算法的性能,使得他们难以提供卓 越的客户体验。这个比赛的目标是创建一个可以用来衡量搜索结果相关性的开源模型。这样,您将帮 助小型企业主获取更多竞争对手提供的丰富资源。它还将为更加成熟的企业提供一个 测试模式。考虑到领先的电子商务网站的查询和结果产品描述,本次比赛要求您评估 其搜索算法的准确性。DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪I数据集介绍ATAGURU 该比赛的数据集是使用CrowdFlower平台上丰富的查询结果配对创建的。他们正在赞 助这项竞赛,作为对开源数据科学界的投资。Crowd Flower收集,清理和标注的数据 集可以使您的监督机器学习梦想成真。为了评估搜索相关性,CrowdFlower已经让他们的人群对少数电子商务网站进行了搜 索。共生成261个搜索词,Crowd Flower将产品列表及其相应的搜索字词放在一起。要求人群中的每个评分者给出产品搜索项1分,2分,3分,4分,表示该项完全满足搜 索查询,1表示该项与搜索项不符。DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪I数据集介绍ATAGURU 本次比赛的挑战是预测产品描述和产品标题的相关性分数。为了确保您的算法足够强 大以处理野外现实世界中的任何嘈杂的HTML片段,产品描述字段中提供的数据是原始 的,并且包含与产品无关的信息。为了阻止手工标注数据,Crowd Flower还提供了额外的数据,没有被测试集中的人群 所标注。计算分数时忽略此数据。DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪数据集介绍 ly露 数据集下载:https:/www.kaggle.eom/c/crowdflower-search-relevance/data train.csv训练集数据-id:产品id-query:搜索词语-product_title:产品标题-product_description:产品描述的完整文本(部分带有HTML标签)-median_relevance:三位评分员的相关性评分中位数.值为1到4的整数.-relevance_variance:评分员的相关性评分的方差.test.csv测试集数据-id:产品id-query:搜索词语-product_description:产品描述的完整文本(部分带有HTML标签)目标变量:median_relevance-DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪I数据集介绍TAGURU训练数据集DFA1idqueryproduct_titleproduct_descnptionmedian_relevancerelevance.vanance21bridal shower decorationsAccent Pillow with Heart Design-Red/BlackRed satin accent pillow embroidered with a heart in black thread 8 x 81032led Christmas lightsSet of 10 Battery Operated Multi LED Train Christmas Ughts-Clear WireSet of 10 Battery Operated Tram Christmas Lights Item#X124210 Features.Color:multi-color bulbs with matching tram light covers/clear wire Multicolor consists of red,green,blue and yellow bulbs Number of bulbs on string:10 Bulb size micro LED Spacing between bulbs:6 inches Lighted length 4.5 feet Total length:5.5 feet 12 inch lead cord Additional product features:LED lights use 90%less energy Cool to the touch If one bulb burns out,the rest will stay lit Lights are equipped with Lamp Lock feature,which makes them replaceable,interchangeable and keeps them from falling out Requires 3 AA”batteries(not included)Convenient on/off/timer switch located on battery pack Timer function on battery pack allows for 6 hours on and 18 hours off Cannot connect multiple sets together UL listed for indoor use only Tram dimensions:1.5 H x 18W x 5、D M凯eriaKs):plastic/wire/acrylic4044projectorViewSonic Pro820C DLP Multimedia Projector404715wine rackConcept Housewares WR-44526 Solid-Wood Ceiling/Wall-Mount Wine Rack,Charcoal Grey.6 BottleLike a silent and sturdy iree.the Southern Enterprises Bird and Branch Coat Rack is an eyecatching addition to your home d 茅cor This tree themed coat rack features strong branches with pinecone accents and a small bird perched at the top to give it a whimsical and welcoming appearance while still making it sturdy enough to hold your coats,hats,umbrellas and more.Whether it serves as a coat rack,a hat rack or a combination of the two檄匐 be a great space saver that gets appreciated for its graceful appearanceNumber of Hooks:10 Frame Material:Metal Hardware Material:Metal40DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸数据集介绍测试数据集ACD35612Bidqueryproduct-titleproduccdescripnon3electric griddleStar-Max 48 in Electric Griddle6Phillips coffee makerPhilips SENSED HD7810 WHITE Single Serve Pod Coffee Maker Espresso Brew Machine9san francisco 49ers2013 San Francisco 49ers ClockA 2013 San Francisco 49ers clock is the ultimate way for you to show off your team spirit.This clock would be a great conversation piece for any office or bedroom and ihe licensed photo features some of the teams best players11aveeno shampooAVEENO 10.5FLOZ NRSH SHINE SHWater,Ammonium Lauryl Sulfate,Dimethicone,Sodium Cumenesulfonate,Cocamide MEA,Cetyl Alcohol,Acrylates Copolymer,Cocamidopropyl Betaine,Fragrance,Phenoxyethanol,Caprylyl Glycol.Glycol Distearaie,Tetrasodium EDTA,Guar Hydroxypropyltrimonium Chloride Triticum Vulgare(Wheat)Germ Oil.Triticum Vulgare(Wheat Gluten,Orbignya Speciosa Kernel Oil,Glycerin.Polyquatemium-10.Astrocaryum Murumuru Seed Butter.Mauritia Flexuosa Fruit Oil,Mica,Titanium Dioxide May Also Contain:Citric Acid,Sodium Hydroxide.12flea and tick control for dogsMerial Frontline Plus Flea and Tick Control for Dogs and Puppies 45-88 pound14table clockClassy Wood Table ClockWatch oui for this antique wood table clock which will surely give a diverse appeal to your home ambience.Made of quality wood material this table clock is durable and easy to maintain.This wood table clock is in round shape with the numbers in roman form.It has a small designer pattern around the borders The strong base will help in keeping this wood table clock firm and steady.Keep this wood table clock in your living room,bedroom or study room to add a hint of vintage feel to the decor It goes well with both modern and traditionally themed houses.This vintage wood table clockcan be gifted to your near and dear ones who love similar kind of decor pieces.Hurry up and qet this amazinqly desianed wood rableDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪TAGURU数据集介绍提交数据格式1idprediction2333634935113612371438153916310183111931221313223142331524316253172631827319293203032133322343233632437325383DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU评分标准 https: quadratic weighted kappa(一)2(N l)2i Etj 皿 j。,jK=一DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU集成学习 集成学习:是目前机器学习的一大热门方向,所谓集成学习简单理解就是指采用多个 分类器对数据集进行预测,从而提高整体分类器的泛化能力。DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU集成学习三种常见框架:bagging、boosting,stackingbagging训练集DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪集成学习ATAGURU加炼数I脸 boostingboostingDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸集成学习stacking训练集stackingDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸集成学习偏差与方差High VarianceS.SCQLow VarianceE(F)W)m=2 力E(/i)Im=y*2町)(CovVar(F)=Varm=2 疗*Par5)+22t I J XI=m2 y2 a2 p+m y*2*p 片 *J后而*a3*(1-p)DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪集成学习ATAGURUHill炼数I脸 bagging的偏差与方差mEV)”85)I1=m uboosting的偏差与方差mE(F)=y*85)iVar(F)=m2*y2*a=*p+m*y2*ct2*(1 p)=m=y2 a2*1+m),=*a2 (1-1)r 7=nr y.Var(F)=m=a p+m y2*a2 (1 p)=m2 a-p+m*上 a*(1 p)m-nx/.(l-p)O-1*p H-mDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU基础模型XGBoost Linear BoosterXG Boost Tree BoosterGradientBoostingRegressorExtraTreesRegressorRandomForestRegressorSVRRidgeKeras NNRGF RegressionTable 7:Model LibraryPackageModelFeatureWeightingXG BoostgblinearMSEHigh/LowYesCOCRSoftmaxSoftkappagbtreeMSELowYesCOCRSoftmaxSoftkappaSklearnGradientBoostingRegressorLowYesExtraTreesRegressorLowYesRandomForestRegressorLowYesSVRLowYesRidgeHigh/LowYesLassoHigh/LowNoLogisticRegressionHigh/LowNoKerasNN RegressionLowNoRGFRegressionLowNoDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸冠军思路分享http:/ ExtractionEnsemble SelectionOutputCounting FeaturesDropping HTML tagsDistanceFeaturesWord ReplacementStemmingTF-IDFFeaturesQuery IdXGBoost Linear BoosterXG Boost Tree BoosterGradientBoostingRegressorExtraTreesRegressorRandomForestRegressorSVRRidgeKeras NNRGF RegressionSubmissionDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸数据探索DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸预处理剔除HTML标签-通过bs4库提取HTML中的文本信息单词替换-拼写错误修正-同义词替换-其他单词替换词干化DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU预处理Table 1:Spelling Correctionmisspellingscorrectionrefrigirator ret hargal 1 batteries adidas assassiiiss creed rabopp k cups pxtftn;il hardisk 50()gbrefrigerator rechargeable batteries adidas fragrance assassins creed racliael ray7 cookware donut shop k cups external hardisk 500 gbTable 2:Synonym Replacementsynonymsreplacementchild,kid bicycle,bike refrigerator,fridge,freezer fragrance,perfume,cologne,eau de toilettekid bike fridge perfumeTable 3:Other Replacementoriginalreplacementnutri systemnutrisystemsoda streamsodastreamplaystationPSps 2ps2ps 3ps3ps 4ps4coffeemakercoffee makerk-cupk cup4-ounce4 ounce8-ounce8 ounce12-ounce12 ounceounceozhardiskhard drivehard diskhard driveharley-davndsonbarley davidsonharleydavidsonharley davidsondoctor whodr wholevi strausslevismac bookmacbookmicro-usbmicro usbvideo gamesvideogamesgame padgamepadwestern digitalwdDATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸特征提取 counting 特征-基本counting特征 Count of n-grani)coiuit of ngi,am(j,n).ngram(fi,n).and ngram(/l,n).Count&Ratio of Digitcount.&ratio of digits in 匕、and 4.Count&Ratio of Unique zz-gramcount&ratio of unique ngram(gn n).ngram(/;,77),and ngram(J,.77).Description Missing Indicatorbinaiy indicator indicating whether&is empty.1 r Coimt&Ratio of as z?-gram in bs 7?-gram诈 Such features were computed for all the combinations of a and b%,&(a r b).Statistics of Positions of as z?gram in 6?s zz-gram For those intersect n-gram,we recorded their positions,and computed the following statistics as features.一 minimum value(0%quantile)-median value(50%quantile)一 maximum value(100%quantile)mean valuestandard deviation(std)Statistics of Normalized Positions of a?s n-gram in bs n-grain These features are simihir with above features,but computed using positions normalized by the length of a.-DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸特征提取距离特征Jaccard coefficienti ir c(n I C B|JacciU-dCoef(yl.B)=.and Dice distance-基本距离特征 D(ngi,ani(qi.n).ngram(/.77.)D(ngram(q,n).ngiam(4,72)D(ngram(f 2,zz).ngram(4.)DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪特征提取DATAGURU距离特征-统计距离特征1.group the samples by median_relevance and(query.median_relevance).Gr=i|ri=r(3)Gq,=i qi=q,7:=r(4)where q /(i.e.all the unique query)and r 1.2.3,4)pute distance between each sample and all the samples in each median_relevance level.Note that we exchided the current sample being considered when computing the distance.For Gqr.we considered the group with same query as the cmrent sample.Sfn=0(ngram(。,九).ngram#)j GrJ*i SQin=)(ngiam(fl,n).ngram(/j?n)j G%,r,j*?(6)where r 6 1,2,3.4 and)(,)(JaccaxclCoef(-.).DiceDist(-,-).3.for and SQi.rm.respectively,compute statistics such as minimum value(0%quantile)median value(50%quantile)maximum value(100%quantile)mean value standard deviation(std)more can be added,e.g.moment features and other quantiles as featmes.-DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸特征提取 TF-IDF 特征-基本TF-IDF特征 TF-IDF Features Basic Cosine Similarity Statistical Cosine Similarity SVD Reduced Features Basic Cosine Similarity Based on SVD Reduced Features Statistical Cosine Similarity Based on SVD Reduced Features_ query unigrani/bigrani and product_title uni gram/bigram query unigi,am/bigrtun and product_description unigi,ain/bigiani query id(qid)and product_title unigram/bigrani query id(qid)and product_description unigram/bigrani cooccuiTence terms for query unigram and product-title unigram is silver fremada.silver sterling,silver silver,silver freeform,silver necklace,necklace fremada.necklace sterling,necklace silver,necklace freeform,necklace necklace cooccurrence terms for query bigiam and product_title unigram is silver necklace fremada.silver necklace sterling,silver necklace silver,silver necklace freeform,silver necklace necklace-DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪ATAGURU加炼数I脸特征提取 其他特征-query的独热编码 独热编码-独热编码即One-Hot编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态 讲行编码,每个状态都由他独立的寄存器位,并日在任意时候,其中只有一位有效自然状态码为:000,001010,011100,101独热编码为:000001,000010.000100,001000.010000,100000DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪I代码 ig勰躅 https: CrowdFlower 特征提取 生成最佳单模型 生成模型库 通过综合选择产生最终判断结果 DATAGURU专业数据分析社区-Python机器学习Kaggle案例实战讲师黄志洪何翠仪I炼数成金逆向收费式网络课程 露躅 Dataguru(炼数成金)是专业数据分析网站,提供教育,媒体,内容,社区,出版,数据分析业务等服务。我们的课程采用新兴的互联网教育形式,独创地发展了逆向收 费式网络培训课程模式。既继承传统教育重学习氛围,重竞争压力的特点,同时又发 挥互联网的威力打破时空限制,把天南地北志同道合的朋友组织在一起交流学习,使 到原先孤立的学习个体组合成有组织的探索力量。并且把原先动辄成千上万的学习成 本,直线下降至百元范围,造福大众。我们的目标是:低成本传播高价值知识,构架 中国第一的网上知识流转阵地。关于逆向收费式网络的详情,请看我们的培训网站http:/DATAGURU专业数据分析社区Python机器学习Kaggle案例实战讲师黄志洪何翠仪ir)ATAGURU bill炼的脸ThanksFAQ时间DATAGURU专业数据分析网站
展开阅读全文

开通  VIP会员、SVIP会员  优惠大
下载10份以上建议开通VIP会员
下载20份以上建议开通SVIP会员


开通VIP      成为共赢上传
相似文档                                   自信AI助手自信AI助手

当前位置:首页 > 应用文书 > 其他

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        抽奖活动

©2010-2025 宁波自信网络信息技术有限公司  版权所有

客服电话:4009-655-100  投诉/维权电话:18658249818

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :微信公众号    抖音    微博    LOFTER 

客服