sora+openai技术文档总结+中英对照原稿.pdf

资源描述

OPEN AISORA 技技术术报报告告原原文文+译译文文+报报告告总总结结要点总结模型路径：1.架构为扩散模型（diffusion model）+transformer2.训练时先用预训练模型把大量的大小不一的视频源文件编码转化为统一的 patch 表示，把时空要素提取作为 transformer 的 token 进行训练。3.模型效果好和超大量的数据集和更多的运算时间息息相关优势：1.人物和背景的连贯性，即时人物运动出了相机范围再回来时还保持同样特征2.自然语言的理解程度很高3.可以在同一个种子下生成不同尺寸（横向竖向）的视频适配不同设备4.可以生成长达 1min 高清视频5.可以以文字，图片，视频作为控制要素控制输出结果不足：1.对于物理规则了解较弱，比如吹气后蜡烛不会熄灭，左右不分，玻璃掉落不会碎2.对于算力要求较高（猜测）可以实现：1.文生视频，图生视频，图+文生视频，视频修改2.视频转绘，视频延伸，视频补全未来畅想：1.重新洗牌 AI 生成视频产业2.扩散模型的上限比想象中的高！3.全局一致性可以被解决4.文字生成 3D 或将迎来突破5.AR,VR，VIsionPro 新型应用潜力大神观点：报告原文报告原文https:/ ators英文原文英文原文中文翻译中文翻译Video generation models as world simulatorsWe explore large-scale training of generative modelson video data.Specifically,we train text-conditionaldiffusion models jointly on videos and images ofvariable durations,resolutions and aspect ratios.Weleverage a transformer architecture that operates onspacetime patches of video and image latent codes.Our largest model,Sora,is capable of generating aminute of high fidelity video.Our results suggest thatscaling video generation models is a promising pathtowards building general purpose simulators of thephysical world.This technical report focuses on(1)our method forturning visualdata of all types into a unifiedrepresentation that enables large-scale training ofgenerative models,and(2)qualitative evaluation ofSorascapabilitiesandlimitations.Modelandimplementation details are not included in this report.Much prior work has studied generative modeling ofvideo data using a variety of methods,includingrecurrentnetworks,1,2,3generativeadversarialnetworks,4,5,6,7 autoregressive transformers,8,9 anddiffusion models.10,11,12 These works often focus ona narrow category of visual data,on shorter videos,oron videos of a fixed size.Sora is a generalist model ofvisual datait can generate videos and imagesspanningdiversedurations,aspectratiosandresolutions,up to a full minute of high definitionvideo.Turning visual data into patchesWe take inspiration from large language modelswhich acquire generalist capabilities by training oninternet-scale data.13,14 The success of the LLMparadigm is enabled in part by the use of tokens thatelegantly unify diverse modalities of textcode,mathand various natural languages.In this work,weconsider how generative models of visual data caninherit such benefits.Whereas LLMs have text tokens,Sora has visual patches.Patches have previously been视频生成模型作为世界模拟器视频生成模型作为世界模拟器我们探索了在视频数据上进行大规模生成模型的训练。具体而言，我们联合在可变持续时间、分辨率和宽高比的视频和图像上训练了文本条件扩散模型。我们利用了一个在视频和图像潜在编码的时空块上操作的 transformer 架构。我们最大的模型，Sora，能够生成一分钟高保真度的视频。我们的结果表明，扩展视频生成模型是建立通用物理世界模拟器的一条有前景的道路。本技术报告关注以下两个方面：(1)我们将各种类型的视觉数据转换为统一表示的方法，以实现大规模生成模型的训练，以及(2)对 Sora的能力和局限性进行定性评估。模型和实现细节未包含在本报告中。之前的研究已经探讨了使用各种方法对视频数据进行生成建模，包括循环网络、生成对抗网络、自回归变压器和扩散模型。这些工作通常侧重于某一类视觉数据、较短的视频或固定大小的视频。Sora 是一种视觉数据的通用模型它可以生成跨越各种持续时间、宽高比和分辨率的视频和图像，高清视频最长可达一分钟。将视觉数据转换成将视觉数据转换成 patch我们受到大型语言模型的启发，这些模型通过在互联网规模的数据上进行训练而获得了通用能力。LLM 范式的成功部分得益于优雅地统一了文本的多种模态代码、数学和各种自然语言的标记。在这项工作中，我们考虑了生成视觉数据模型如何继承这些好处。而 LLMs具有文本标记，Sora 具有视觉 patch。patch 已被证明是视觉数据模型的有效表示。shown to be an effective representation for modelsof visual data.15,16,17,18 We find that patches are ahighly-scalableandeffectiverepresentationfortraining generative models on diverse types of videosand images.At a high level,we turn videos into patches by firstcompressing videos into a lower-dimensional latentspace,19andsubsequentlydecomposingtherepresentation into spacetime patches.Video compression networkWe train a network that reduces the dimensionality ofvisual data.20 This network takes raw video as inputandoutputsalatentrepresentationthatiscompressed both temporally and spatially.Sora istrained on and subsequently generates videos withinthiscompressedlatent space.We alsotrain acorresponding decoder model that maps generatedlatents back to pixel space.Spacetime Latent PatchesGiven a compressed input video,we extract asequenceofspacetimepatcheswhichactastransformer tokens.This scheme works for images toosince images are just videos with a single frame.Ourpatch-based representation enables Sora to train onvideos and images of variable resolutions,durationsand aspect ratios.At inference time,we can controlthesizeofgeneratedvideosbyarrangingrandomly-initialized patches in an appropriately-sizedgrid.Scaling transformers for video generationSora is a diffusion model21,22,23,24,25;given inputnoisy patches(and conditioning information like textprompts),its trained to predict the original“clean”patches.Importantly,Sora is a diffusion transformer.26Transformers have demonstrated remarkable scalingproperties across a variety of domains,includinglanguage modeling,13,14 computer vision,15,16,17,18and image generation.27,28,29我们发现，patch 是一种高度可扩展且有效的表示方法，适用于训练不同类型的视频和图像的生成模型。在高层次上，我们通过首先将视频压缩成低维度潜在空间，然后将表示分解为时空补丁来将视频转换成补丁。视频压缩网络视频压缩网络我们训练了一个网络来降低视觉数据的维度。这个网络以原始视频作为输入，并输出一个在时间和空间上都被压缩的潜在表示。Sora 在这个压缩的潜在空间上进行训练，并随后生成视频。我们还训练了一个对应的解码器模型，将生成的潜在空间映射回像素空间。时空潜在补丁时空潜在补丁给定一个压缩的输入视频，我们提取一系列的时空补丁，这些补丁充当 transformer 的 tocken。我们基于补丁的表示使得 Sora 能够在不同分辨率、持续时间和宽高比的视频和图像上进行训练。在推理时，我们可以通过将随机初始化的补丁适当地排列在一个大小合适的网格中来控制生成视频的大小。将将 transformer 扩展到视频生成扩展到视频生成Sora 是一个扩散模型；给定输入的初始噪声（以及文本提示等条件信息），它被训练为预测原始的“干净”补丁。重要的是，Sora 是一个扩散 transformer。transformer 在多个领域展示了显著的扩展性能，包括语言建模、计算机视觉和图像生成。In this work,we find that diffusion transformers scaleeffectively as video models as well.Below,we show acomparison of video samples with fixed seeds andinputs as training progresses.Sample quality improvesmarkedly as training compute increases.Variable durations,resolutions,aspect ratiosPast approaches to image and video generationtypically resize,crop or trim videos to a standard size e.g.,4 second videos at 256x256 resolution.We findthat instead training on data at its native size providesseveral benefits.Sampling flexibilitySora can sample widescreen 1920 x1080p videos,vertical 1080 x1920 videos and everything inbetween.This lets Sora create content for different devicesdirectly at their native aspect ratios.It also lets usquickly prototype content at lower sizes beforegenerating at full resolutionall with the same model.Improved framing and compositionWe empirically find that training on videos at theirnativeaspectratiosimprovescompositionandframing.We compare Sora against a version of ourmodel that crops all training videos to be square,which is common practice when training generativemodels.The model trained on square crops(left)sometimes generates videos where the subject is onlypartially in view.In comparison,videos from Sora(right)s have improved framing.在这项工作中，我们发现扩散变压器在视频模型中也能有效地扩展。在下面，我们展示了随着训练进行，具有固定种子和输入的视频样本的比较。随着训练计算量的增加，样本质量显著提高。可变持续时间、分辨率、宽高比可变持续时间、分辨率、宽高比过去的图像和视频生成方法通常将视频调整为标准大小，例如，4 秒的视频以 256x256 分辨率。我们发现，与其这样处理，训练原始大小的数据提供了几个好处。采样灵活性采样灵活性Sora 可以采样宽屏 1920 x1080p 视频、竖屏 1080 x1920视频以及介于两者之间的所有内容。这使得 Sora 可以直接以原生宽高比为不同设备创建内容。它还使我们能够在生成全分辨率之前，快速原型化低分辨率的内容而且只需使用同一个模型。改进的构图和组合改进的构图和组合我们凭经验发现，以视频的原生宽高比进行训练可以改善构图和组合。我们将 Sora 与我们的模型的一个版本进行比较，该版本将所有训练视频裁剪为正方形，这是训练生成模型时的常见做法。在使用正方形裁剪训练的模型（左侧）有时会生成主体仅部分可见的视频。相比之下，Sora 生成的视频（右侧）具有改进的构图。Language understandingTraining text-to-video generation systems requires alarge amount of videos with corresponding textcaptions.Weapplythere-captioningtechniqueintroduced in DALLE 330 to videos.We first train ahighly descriptive captioner model and then use it toproduce text captions for all videos in our training set.We find that training on highly descriptive videocaptions improves text fidelity as well as the overallquality of videos.Similar to DALLE 3,we also leverageGPT to turn short user prompts into longer detailedcaptions that are sent to the video model.Thisenables Sora to generate high quality videos thataccurately follow user prompts.Prompting with images and videosAll of the results above and in our landing page showtext-to-video samples.But Sora can also be promptedwith other inputs,such as pre-existing images orvideo.This capability enables Sora to perform a widerange of image and video editing taskscreatingperfectly looping video,animating static images,extending videos forwards or backwards in time,etc.Animating DALLE imagesSora is capable of generating videos provided animage and prompt as input.Below we show examplevideos generated based on DALLE 231 and DALLE 330images.语言理解语言理解训练文本到视频生成系统需要大量具有对应文本标题的视频。我们将 DALLE 3 引入的重新标题技术应用到视频中。我们首先训练一个高度描述性的标题模型，然后使用它为我们训练集中的所有视频生成文本标题。我们发现，训练在高度描述性视频标题上可以提高文本的准确性以及视频的整体质量。类似于 DALLE 3，我们还利用 GPT 将用户简短提示转换为更详细的长标题，然后将其发送给视频模型。这使得 Sora 能够生成高质量的视频，准确地遵循用户的提示。使用图像和视频使用图像和视频作为输入作为输入 prompt我们在上述所有结果和我们的登陆页面上展示的都是文本到视频的样本。但是 Sora 也可以使用其他输入来提示，例如预先存在的图像或视频。这种能力使得Sora 能够执行各种图像和视频编辑任务创建完美循环的视频，给静态图像添加动画，将视频向前或向后延伸等等。把把 DALLE 图像图像变成动画变成动画Sora 能够生成基于 DALLE 231 和 DALLE 330 图像的视频，只需提供图像和提示作为输入。下面我们展示了基于这些图像生成的示例视频。Extending generated videos Sora is also capable ofextending videos,either forward or backward in time.Belowarefourvideosthatwereallextendedbackward in time starting from a segment of agenerated video.As a result,each of the four videosstarts different from the others,yet all four videoslead to the same ending.We can use this method toextend a video both forward and backward to producea seamless infinite loop.Video-to-video editingDiffusion models have enabled a plethora of methodsfor editing images and videos from text prompts.Below we apply one of these methods,SDEdit,32 toSora.This technique enables Sora to transform thestyles and environments of input videos zero-shot.Connecting videosWe can also use Sora to gradually interpolate betweentwoinputvideos,creatingseamlesstransitionsbetween videos with entirely different subjects andscene compositions.In the examples below,the videosin the center interpolate between the correspondingvideos on the left and right.Image generationcapabilities Sora is also capable of generating images.We do this by arranging patches of Gaussian noise in aspatial grid with a temporal extent of one frame.Themodel can generate images of variable sizesup to2048x2048 resolution.延长生成的视频延长生成的视频Sora 还能够延长视频，无论是向前还是向后延长。下面是四个视频，它们都是从一个生成的视频片段开始向时间的后方延长。因此，这四个视频的开头各不相同，但最终都会导向相同的结尾。我们你也可以用这个方法扩展一个视屏的头和尾让他首尾相连成一个无限循环的视频。视频到视频编辑视频到视频编辑扩散模型已经为从文本提示编辑图像和视频提供了大量方法。下面我们将其中一种方法，SDEdit，应用到 Sora 上。这种技术使得 Sora 能够在零样本情况下转换输入视频的风格和环境。连接视频连接视频我们还可以使用 Sora 逐渐插值两个输入视频之间，从而在完全不同的主题和场景构图的视频之间创建无缝的过渡。在下面的示例中，中间的视频在左侧和右侧对应视频之间进行插值。图像生成能力图像生成能力Sora 也能够生成图像。我们通过将高斯噪声的补丁以一个帧的时间范围排列成空间网格来实现这一点。该模型可以生成不同尺寸的图像，分辨率高达2048x2048。Emerging simulation capabilitiesWe find that video models exhibit a number ofinteresting emergent capabilities when trained atscale.These capabilities enable Sora to simulate someaspects of people,animals and environments from thephysical world.These properties emerge without anyexplicit inductive biases for 3D,objects,etc.they arepurely phenomena of scale.3Dconsistency.Soracangeneratevideoswithdynamic camera motion.As the camera shifts androtates,people and scene elements move consistentlythrough three-dimensional space.Long-range coherence and object permanence.A significant challenge for video generation systemshas been maintaining temporal consistency whensampling long videos.We find that Sora is often,though not always,able to effectively model bothshort-and long-range dependencies.For example,ourmodel can persist people,animals and objects evenwhen they are occluded or leave the frame.Likewise,it can generate multiple shots of the same character inasinglesample,maintainingtheirappearancethroughout the video.Interacting with the world.Sora can sometimes simulate actions that affect thestate of the world in simple ways.For example,apainter can leave new strokes along a canvas thatpersist over time,or a man can eat a burger and leavebite marks.Simulating digital worlds.Sora is also able to simulate artificial processesoneexample is video games.Sora can simultaneouslycontrol the player in Minecraft with a basic policywhile also rendering the world and its dynamics inhighfidelity.Thesecapabilitiescanbeelicitedzero-shot by prompting Sora with captions mentioning“Minecraft.”涌现出模拟的能力涌现出模拟的能力我们发现，在大规模训练时，视频模型表现出许多有趣的新兴能力。这些能力使得 Sora 能够模拟来自物理世界的一些人、动物和环境的方面。这些属性是在没有任何明确的归纳偏见的情况下出现的，比如对 3D、物体等它们纯粹是规模现象。3D 一致性。Sora 可以生成具有动态摄像机运动的视频。随着摄像机的移动和旋转，人物和场景元素在三维空间中保持一致的移动。长程连贯性和对象永恒性。长程连贯性和对象永恒性。对于视频生成系统来说，一个重要挑战是在采样长视频时保持时间一致性。我们发现，Sora 通常能够有效地模拟短期和长期依赖关系，尽管并非总是如此。例如，我们的模型可以在人、动物和物体被遮挡或离开画面时仍然保持其持久性。同样地，它可以在一个样本中生成同一个角色的多个镜头，并在整个视频中保持其外观。与世界进行交互。Sora 有时可以模拟一些简单方式影响世界状态的动作。例如，一个画家可以在画布上留下持续一段时间的新笔触，或者一个人可以吃掉一个汉堡并留下咬痕。模拟数字世界。模拟数字世界。Sora 还能够模拟人工过程，一个例子是视频游戏。Sora可以同时使用基本策略控制我的世界中的玩家，并以高保真度呈现世界及其动态。这些能力可以通过提示 Sora 提到“Minecraft”的标题来零样本激发。These capabilities suggest that continued scaling ofvideo models is a promising path towards thedevelopment of highly-capable simulators of thephysical and digital world,and the objects,animalsand people that live within them.DiscussionSora currently exhibits numerous limitations as asimulator.For example,it does not accurately modelthe physics of many basic interactions,like glassshattering.Other interactions,like eating food,do notalways yield correct changes in object state.Weenumerate other common failure modes of themodelsuch as incoherencies that develop in longduration samples or spontaneous appearances ofobjectsinourlandingpage.WebelievethecapabilitiesSorahastodaydemonstratethatcontinued scaling of video models is a promising pathtowards the development of capable simulators of thephysical and digital world,and the objects,animalsand people that live within them.这些能力表明，持续扩展视频模型是发展高度能力的物理世界和数字世界模拟器，以及其中的物体、动物和人的有前景的途径。讨论讨论目前，Sora 作为模拟器表现出了许多限制。例如，它并不能准确地模拟许多基本交互的物理特性，比如玻璃破碎。其他交互，比如吃食物，并不总是产生正确的物体状态变化。我们在我们的登陆页面上列举了模型的其他常见失败模式例如，在长时间样本中发展的不一致性或对象的突然出现。我们相信，Sora 目前的能力证明了持续扩展视频模型是发展能力强大的物理世界和数字世界模拟器，以及其中的物体、动物和人的有前景的途径。

展开阅读全文