1、PAI-ChatLearn!#$%&!#$%-&$%()PAI!#$%&(RLHF)*+,-.Contents目录0102PAI-ChatLearn03PAI-ChatLearn01#$%&*+,-./?!#$%&()*+,-(./!Zero DP/FSDP 01234567Zero DP!#$#$%&*+#./?!89:;?#$ABC#$D(!E;F89GGPT3-175BD=HIJK#$LM?#$BC#$?#$+!#$BC#$+!#$?#$+BC#$+!#$#$%&*+01./?!JNO/J8P89()MPMDPathways1RLHF21 https:/blog.google/techn
2、ologyai/introducing-pathways-next-generation-ai-architecture/2 https:/arxiv.org/abs/2204.05862PAI-ChatLearn02RLHF(Reinforcement Learning from Human Feedback)RLHF234567#89:;RL?A/BCDEFGHIJ?!?!?!?!?!?PAI-ChatLearnPAI-ChatLearnKLMN/O!P RLHF QR$%*+?!?!?!?!?!?!?PAI-ChatLearn36?!APIY*Z898)RLHF Config_e2e(!
3、)Model Config_abmodelF!EngineYcdefghijk/)cd(jlmFnop$!DistActor7Uqr,bistNOjgh)cdabmodelFp$uvjghi!(jlmwxy7zFBackend)|Gy7Megatron()y7vLLMlmPAI-ChatLearn初始化定义模型定义engine和数据集开始训练运行环境配置模型配置RLHF训练配置PAI-ChatLearn03PAI-ChatLearn?-#AB?!RLHFModule()!nsetupAforward_stepAtrain_step X!MegatronADeepspeedAPyTorchAvL
4、LMATorchAccXnVicuna 13B+13B=?!)t query:,response:!hVicuna89)Vicuna-13B89|!(t!ChatLearn*ZetransformersMegatront)k7ChatLearnS$SFT(?!)t query:,response:1,2,.,score:score1,score2,.!(!y7SFT89!UReward Model!k7ChatLearnS$Reward Model(?!t prompt:!(!y7SFT89!Policy/Reference Model!y7RM89!Reward/Value model!k7
5、ChatLearnS$RLHF(?!(F89Inference!k7MegatronS$C!k7ChatLearneTransformertInferenceVicuna 13B+13B=?!13B(Policy/Reference Model)+13B(Reward/PPOValue Model)!QHH(helpful&harmless)!S$SFTARLHF(!RLHF(Reward ModelFiG!QMT-Benchy7GPT-4 APISFTjRLHF89FVW)i*+11%PAI-ChatLearnCDEFG?!7B+7B30B+30B8DeepSpeed-chat&48%82%!DeepSpeed-Chat 66B+66BOOM!ChatLearn66B+66BA175B+175B8(!Qwen-14BRLHF89VW4SFT89!#$%&()*+,!#%)*+,#!%-./0123/42567)894:*;)?A/85B*80C6%6D-!E&EE&$E&)*+,!#%)*+,#!%-./0123/42567)894:*;)?A/85B*80C6#-6D-!THANKS