1、PAI LLMContents020305010401XLABXPSPAI-TensorflowPAI-PyTorchPAI-StudioDLC DSW EASNLP/CV/千亿参数 ODLM6OFA Swin-TransformerPAIAI 9SLA 数据训练推理稳定性PAI面向LLM全链路的一站式智算平台02-Data Deduplication from Google(2022/03)-Text Deduplication from BigCode(2023/05)-The RefinedWeb for Falcon LLM(2023/06)高质量的文本输入可以获得更好的大语言模型 j
2、ieba MinHashMinHashLSH GABABGG 1.2.Power law10 Distributed union find 1.join 2.图连通分量算法示例实现样本数重复率耗时PrecisionRecallF1PAI5亿50%1h 34min879993其他实现5亿50%4h 10min859290PAI10亿50%3h 0min829990其他实现10亿50%6h 54min80908503 A general framework that helps dispatching the operators into new backends(AICompiler)and m
3、eanwhile provides new Tensor expression that swaps in eager mode.AIAn Compiler that uses the advanced optimization skills in order to support high performance codegen.Support FSDP,TP and other distribute strategies.TorchAcceleratorTorchAcceleratorTorchAccelerator基于Kube Scheduler FrameworkAIASW/DSW/P
4、SW合适的网络架构的调度选择可以更充分的释放高性能网络的潜力04LLMEAS OPT/GPT/Bloom/GLM *模型压缩权重量化激活量化KV Cache量化系统优化编译器优化高性能算子库分布式执行张量并行流水并行Nvidia GPUAMD GPU建模主流模型高性能实现 开源模型全兼容OPT-66BGPU01234A100(80GB)V100(32GB)A10(24GB)fp16int8int4OPT-66Bperplexity036912wikitext2ptbc4fp16int8int4服务吞吐提升1.73.8倍首包延迟降低8.713.8倍LLMBladeLLMModel weights/configCompressionCompilingServingUserPlatform05高性能灵骏集群带来了非常有挑战的稳定性ECC ErrorNCCL TimeoutNCCL HangPCIE降速NVLink ErrorAIMasterEasyCKPT AIMaster HangCheckpointEasyCKPT 多级存储异步并行存储最快支持秒级存储,大幅减少计算上的浪费EasyCKPTServerless PAIPAI面向LLM全链路的一站式智算平台THANKS