资源描述
单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,*,单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,跨媒体检索与分析,1,什么是跨媒体?从应用平台方面理解,电视机,电脑,手机,报纸,Ipad,2,以文字搜文字,以图片搜图片,以文字搜图片,以文字搜视频,什么是跨媒体?从检索研究方面理解,3,什么是跨媒体?,2010,年,1,月,Nature,发表的“,2020 Vision,”论文指出:文本、图像、语音、视频及其交互属性将紧密混合(,mix,)在一起,即“跨媒体”。,2011,年,2,月,Science,开灯“,Dealing with Data,”专辑:数据的组织和使用体现跨媒体计算。,趋势:从“,多媒体,”研究向“,跨媒体,”发展!,4,什么是跨媒体?,跨媒体特性即多媒体数据之间以及用户互动与多媒体数据之间存在着内容跨越与语义关联。,吴飞,庄越挺,.,互联网跨媒体分析与检索,:,理论与算法,.,计算机,辅助设计与图形学学报,,Vol.22,No.1,pp.1-9,2010.,5,跨媒体的主要研究范畴,跨媒体检索,:用户向计算机提交一种类型的多媒体对象作为查询例子,系统可以自动找到其它不同类型及语义上相似的多媒体对象。,跨媒体推理,:跨媒体推理是指从一种类型的多媒体数据,经过问题求解转向另外一种类型的多媒体数据。(,OCR,等),跨媒体存储,:现有处理海量数据的检索技术主要是针对文本信息,如,google,和百度等搜索引擎。跨媒体存储研究高效压缩、索引和分片等方法,以及对用户行为的个性化索引等技术。,惊涛骇浪,6,?,Audio,Video,Webpage,Correlated multi-modal Data,Shared space,How to bridge both,semantic-gap,and,heterogeneity gap,?,Japan Earthquake,跨媒体分析的挑战,From FeiWu,7,跨媒体的内容鸿沟,视觉特征空间,听觉特征空间,高层语义空间,爆炸、海洋、天空、鸟。,语义鸿沟,内,容,鸿,沟,8,基于线性变换的子空间映射算法,视觉特征空间,听觉特征空间,投影,子空,间,9,Heterogeneous Metric Learning with Joint Graph Regularizationfor Cross-Media Retrieval,Xiaohua Zhai,Yuxin Peng and Jianguo Xiao,Institute of Computer Science&technology,Peking University,AAAI 2013,10,Existing metric learning methods have previously been designed primarily for single-media data and cannot be directly applied to cross-media data.,Make full use of the structure information of the whole heterogeneous spaces.,Motivation,11,Heterogeneous Metric Learning,Given two sets of heterogeneous pairwise constraints,S,is the set of similarity constraints and,D,is the set of dissimilarity constraints.,Each pairwise constraints(,x,i,y,j,)indicates if two heterogeneous media objects,x,i,and,y,j,are relevant or irrelevant inferred from the category label.,Joint Graph Regularized Heterogeneous Metric,12,They propose to learn multiple linear transformation matrices,U,and,V,they can map the,heterogeneous media data to a common output spaces.,The distance measure is defined as:,Joint Graph Regularized Heterogeneous Metric,13,Objective function,The formulation of the general regularization framework for heterogeneous distance metric learning is defined as:,f(U,V),is the loss function defined on the sets of similarity and dissimilarity constraints,S,and,D,g(U,V),and,r(U,V),are regularizer defined on the target parameter matrices,U,V,.,are the balancing parameters.,Joint Graph Regularized Heterogeneous Metric,14,Loss function,The minimization of the loss function will result in minimizing (maximizing)the distances between the media objects with the similarity(dissimilarity)constraints,Normalize the elements of,Z,column by column to make sure that the sum of each column is zero-to balance the influence of the similarity constraints and dissimilarity constraints.,Joint Graph Regularized Heterogeneous Metric,15,Scale regularization,r(U,V)is used to control the scale of the parameters matrices and reduce overfitting.,Joint Graph Regularized Heterogeneous Metric,16,Joint graph regularization,Defining a joint undirected graph,G=(V,W)on the dataset.,Each element,w,ij,of the similarity matrix W=,w,ij,(m+n)(m+n),means the similarity between the,i,-th media object and,j,-th media object.,Using label information to construct the symmetric similarity matrix:,where,Joint Graph Regularized Heterogeneous Metric,17,Joint graph regularization,Setting,w,ii,=0,for 1,i,m+n to avoid self-reinforcement.And the normalized graph Laplacian,L,is defined as,:,Where I is an(m+n)(m+n)identity matrix and D is an(m+n)(m+n)diagonal matrix with .,is symmetric and positive semidefinite,with eigenvalue in the interval 0,2.,where O represents for all of media objects in the learned metric space.denotes the normalized graph Laplacian.,Joint Graph Regularized Heterogeneous Metric,18,Joint graph regularization,The formulation of,g(U,V),:,Minimizing g(U,V)encourages the smoothness of a mapping over the joint data graph,which is constructed from the initial label information,Joint Graph Regularized Heterogeneous Metric,19,Iterative optimization,Obtain orthogonal transformation matrices,U,and,V,they minimize the following object function:,where X and Y represent for two sets of coupled media objects from different media with the same labels.U and V define two orthogonal transformation spaces where media objects in X and Y can be projected as close to each other as possible.,Maximize tr(X,T,UV,T,Y)will minimize function,its singular value decomposition:,Joint Graph Regularized Heterogeneous Metric,20,Fix,V,and update,U,Different,Q(U,V),with respect to,U,and V setting it to zero,respectively:,Obtain the analytical solution,U,and,V,as,We alternate between updates to U and V for several iterations to find a locally optimal solution.Here the iteration continues until the cross-validation performance decreases on the training set.In practice,the iteration only repeats several rounds.,Joint Graph Regularized Heterogeneous Metric,21,Datasets,Wikipedia:2866 image-text pairs with label from the 10 semantic categories.This dataset is randomly split into a training set of 2173 documents and a test set of 693 documents.,XMedia dataset:5000 texts,5000 images,1000 audio,500 videos and 500 3D models.This dataset is randomly split into a training set of 9600 media objects and a test set of 2400 media objects.,Experiments,22,Features,Images:using bag-of-word model.Each image is represented as a histogram of 128-codeword SIFT codebook.,texts:each text represented as a 10-topic latent Dirichlet Allocation(LDA)model.,Audio:29-dim MFCC features to represent each clip of audio.,Videos:segmenting each clip of video into video shots.Then 128-dimension BoW histogram features are extracted for each video keyframe.The final similarity for video is obtained by averaging all of the similarities of the video keyframes.,3D model:Each 3D model is firstly represented as the concatenated 4700-dimension vector of a set of Light-Field descriptors as described in.Then the concatenated vector is reduced to 128-dimension vector based on Principal Component Analysis(PCA),Experiments,23,Baseline methods and Evaluation metrics,CCA(Canonical correlation analysis):Through CCA we could learn the subspace that maximizes the correlation between two sets of heterogeneous data.,CFA(cross-modal factor analysis):it adopts a criterion of minimizing the Frobenius norm between pairwise data in the transformed domain,CCA+SMN is current state-of-the-art,since it consider not only correlation analysis but also semantic abstraction for dierent modalities,.,Experiments,24,MAP scores,Experiments,25,Precision-Recall curves,Experiments,26,多媒体数据的统一表达,多媒体数据的表达是指采用哪个一定的数据结构来表示多媒体样本。例如,采用四元组,表示,web,页面中的一幅图像,或者提取图像的底层视觉特征,构成多维向量来表示数据库中的图像。,跨媒体检索属于基于内容的多媒体检索范畴,只不过在检索对象上从单一类型的多媒体数据扩充到多种不同类型的多媒体数据,支持数据间的灵活跨越。,跨媒体检索的性能很大程度上依赖于相似度匹配算法,而相似度匹配正式以不同类型的多媒体数据所采用的表达方式为依据的。因此数据表达模型的设计师非常基础和重要的。,27,多媒体数据的统一表达,设有尚未标注的图像和音频数据集合 ,作为训练数据集合,已知覆盖了,Z,个语义类别,映射算法描述如下:,步骤,1,聚类,1,)对于每一个语义类别,Z,i,,分别提取其中包括的图像和,音频数据的底层内容特征,建立相应的特征矩阵,S,I,S,A,;,2,)对于每一个语义类别,Z,,,随机选取,m,个图像例子,I,i,进行,语义标注;,3,)计算,I,i,在底层特征空间上的聚类质心,ICr,i,;,4,)与,ICr,i,为起始条件,对数据库中所有的图像数据进行,kmeans,聚类;,5,)聚类结果中属于相同类别的图像被赋予与,I,i,相同的语,义标记;,6,)对音频数据集重复,1-4,。,28,多媒体数据的统一表达,相关性保持映射,1,)分析图像和音频之间在底层内容特征上的典型相关性,,即计算,S,I,和,S,A,对应的子空间基向量,W,x,和,W,y,;,2,)求取视觉和听觉特征响亮映射到子空间中的向量表示:,29,Web,环境中的跨媒体相关性推理,在具体的应用环境中,如,web,往往包含了一些具体的数据特征,这些特征比多媒体数据本身的内容特征蕴含更直接的语义信息,可以用来辅助内容特征进行跨媒体检索,提高检索效率。例如,,web,连接就可以作为一种辅助特征。,30,跨媒体关联图,图模型是一种常用的数据关系表达方式,可以用途模型表达,web,环境中的图像,以及图像相关的各种特征。这种表达方式不但可以清楚地描述数据之间的各种联系,而且有助于发现数据之间的互补信息。,对于多媒体数据而言,多种类型的多媒体数据之间存在着复杂的数据关系,主要可以划分为模态内部(,intra-media correlation,)和模态之间(,cross-media correlation,)两种数据关系。,31,链接关系分析,分别用,V,I,A,表示视频、图像和音频数据集,,m,n,k,分别是数据集,V,I,A,中的样本个数,用,x,V,i,x,I,i,x,A,i,分别表示数据库中第,i,个视频、第,i,个图像,以及第,i,个音频数据的特征向量。,根据如下两个启发式规则,可以利用,web,环境中多媒体数据所在网页之间的链接关系,度量不同类型多媒体数据之间的相关性(,cross-media distance,)大小:,规则,1,:,如果两个媒体对象,a,和,b,同属于一个,web,页面,则,a,和,b,在语义具有相似性;,规则,2,:,如果,web,页面,A,指向另一页面,B,和,C,,则,B,中包含的多媒体对象和,C,中包含的多媒体对象在语义上具有相似性。,32,链接关系分析,根据上述启发规则,建立视频,-,图像、图像,-,音频和音频,-,视频的跨媒体关联矩阵,L,VI,L,IA,L,AV,,以,L,IA,为例,其矩阵元素,r,ij,表示多媒体数据 之间的相关值,,r,ij,计算方法如下:,输入:从,web,页面获取的图像和音频数据,输出:跨媒体相关矩阵,LIA,1.,2.,3.,4.,5.Construct a symmetric matrix L,IA,whose cell,l,ij,is the normalized values of r,ij,.,33,基于图模型的全局相关性推理,图像,音频,Ia,Ib,Ic,Id,Aa,Ab,Ac,34,近年来的研究热点,Cross-media Retrieval,Cross-media Ranking,Cross-media Hashing,Cross-collection Topic Modeling,From FeiWu,35,Mission,:,learn one appropriate metric for ranking multi-modal data to preserve the orders of relevance.For example,The retrieved images are ranked in term of their relevance to the query textual document in a listwise manner.,Query Textual Document,Ranked Listwise Image Results,The recent research about cross-media learning,Cross-media Ranking,From FeiWu,36,Mission,:,attempt to learn hashing function(s)to faithfully preserve the intra-modality and inter-modality similarities and map the high-dimensional multi-modal data to compact binary codes.,Multi-modal Document,(one image with its narrative text),0,1,1,1,0,1,1,0,0,1,1,1,0,1,1,1,0,0,Hashing Function,The recent research about cross-media learning,Cross-media Hashing,From FeiWu,37,Mission,:,describe one topic/event with aspect-oriented(e.g.,who-what-how)multi-modal data(e.g.,representative images or topical words).,Where,Who,What,How,Why,Whe,n,Topic/,Event,Text,Video,Image,The recent research about cross-media learning,Cross-collection topic modeling,From FeiWu,38,结束!,39,
展开阅读全文