基于Unity3D的数字虚拟人交互技术研究与应用.pdf

资源描述

1、PRINTING AND DIGITAL MEDIA TECHNOLOGY STUDY Tol.229 No.2 2024.04印刷与数字媒体技术研究 2024年第2期（总第229期）RESEARCH PAPERS研究论文Research and Implementation of Digital Virtual Human Interaction Technology Based on Unity3DLI Guang-ya,SI Zhan-jun*(College of Artificial Intelligence,Tianjin University of Science and Tec

2、hnology,Tianjin 300457,China)Abstract Currently,digital virtual human interaction technology faces issues like language understanding errors and limited emotional expression,resulting in a negative user experience.In this study,the current status and challenges of the technology were analyzed,a Unit

3、y3D-based interaction technology was introduced,and a technique for generating emotional speech directly from text was proposed.The approach combined with ChatGPT text comprehension and generation,text emotion analysis,and improved VITS speech synthesis.A digital virtual human interaction applicatio

4、n capable of accurately understanding and modelling emotional responses was developed by simulating holographic interaction effects using a Kinect 2.0 device.The experimental results demonstrated that the technology improves both the interaction and emotional expression abilities of digital virtual

5、human,providing the significant value for application and development.Key words Digital media;Artificial intelligence;Media interaction;Speech synthesis基于Unity3D的数字虚拟人交互技术研究与应用李光亚，司占军*（天津科技大学人工智能学院，天津 300457）摘要目前，数字虚拟人交互技术虽然能够实现与用户的基本交互，但仍然存在着语言理解偏误、缺乏情感表达能力等一系列问题，导致用户的交互体验感不足。在此背景下，本研究首先分析了数字虚拟人技

6、术的发展现状和存在的问题，进而探究了基于Unity3D的数字虚拟人交互技术，并提出了一种由文本直接生成带有情感特征语音的方法。基于此，将其与ChatGPT语言理解与文本生成、文本情感分析和改进后的VITS语音合成技术结合，并使用Kinect 2.0设备模拟全息交互效果，最终构建了一款能够进行准确理解并模拟情感回应的数字虚拟人交互应用。结果表明，该技术可有效提高数字虚拟人的理解与表达能力，为用户提供更好的交互体验，对于数字虚拟人技术的应用和发展具有参考价值。关键词数字媒体；人工智能；媒体交互；语音合成中图分类号 TP391.9文献标识码 A文章编号 2097-2474(2024)02-123-

7、012DOI 10.19370/10-1886/ts.2024.02.014收稿日期：2023-06-21 修回日期：2023-09-29 *为通讯作者本文引用格式：LI Guang-ya，SI Zhan-jun.Research and Implementation of Digital Virtual Human Interaction Technology Based on Unity3D J.Printing and Digital Media Technology Study，2024，(2)：123-134.2024年2期印刷与数字媒体技术研究（拼版）.indd 1232024年2

8、期印刷与数字媒体技术研究（拼版）.indd 1232024/4/26 17:08:112024/4/26 17:08:11124印刷与数字媒体技术研究2024年第2期（总第229期）0 IntroductionDigital virtual human interaction systems represent advanced technological systems capable of simulating interactions between virtual entities and humans,creating a natural virtual communication

9、experience.These systems have a wide range of applications not only in education,entertainment and healthcare,but also play pivotal roles in various other domains such as virtual conferences,training simulations,and virtual tours 1.However,the current digital virtual human interaction systems face t

10、wo primary challenges 2.Firstly,the virtual characters struggle to accurately convey emotions,limiting emotional interaction with users.Secondly,the language understanding errors result in inaccuracies when interpreting and responding to user voice and text inputs,decreasing the overall user experie

11、nce.The study has aimed to explore a technical solution to enhance the understanding and emotional expression of digital humans and to develop a more precise and efficient digital virtual human interaction system 3 to address these issues.The goals include enhancing the emotional expression capabili

12、ties of virtual characters,enabling them to accurately and naturally convey a range of emotions for a more engaging emotional interaction with the users.In addition,the research seeks to reduce the language understanding errors to better meet the needs of users.To achieve these goals,the study utili

13、zed a range of key technologies and methods.The Unity3D engine was employed.An innovative speech synthesis technique capable of generating emotionally infused speech directly from the text was introduced.Language understanding and expression were enhanced using ChatGPT 4.Emotion analysis was carried

14、 out using Bert+Go,and through the improvement of the VITS Speech Synthesis Model,the speech output could be made emotionally rich.Finally,the Kinect 2.0 device was integrated to simulate holographic interaction effects,which increased the realism and interactivity of the virtual character.1 Researc

15、h Methods and Implementation1.1 Unity3DIn digital virtual human interaction,ChatGPT plays a crucial role in language understanding and text generation.Therefore,there is a need for a highly scalable development tool to address the complexities of integrating advanced language models and creating int

16、eractive,immersive virtual environments.Unity3D,as a powerful development platform,provides developers with rich tools and resources.Creating digital virtual human projects based on Unity3D allows for the creation of interactive,visually appealing,highly scalable,and integrated systems with ChatGPT,

17、expanding the application scope of digital virtual humans.Leveraging the advantages of the Unity3D engine,the appearance and animation of digital virtual humans could be enhanced using Unity3D UPR rendering pipeline,which in turn could further improve user immersion.This enhancement had extensive ap

18、plication potential in education,entertainment,virtual tours,virtual live broadcasting,and other fields.The scalability of this enhanced technology allowed the developers to continuously improve and enhance the functionality of digital virtual humans.Combined with the ChatGPT model,the integrated sy

19、stem enabled the fusion of virtual human knowledge bases and personalized user experiences,to meet the needs of various application domains.1.2 Integration of the ChatGPT ModelChatGPT is an artificial intelligence language model designed to generate natural language text.GPT is 2024年2期印刷与数字媒体技术研究（拼版

20、）.indd 1242024年2期印刷与数字媒体技术研究（拼版）.indd 1242024/4/26 17:08:122024/4/26 17:08:12125研究论文LI Guang-ya et al:Research and Implementation of Digital Virtual Human Interaction Technology Based on Unity3Dbased on the Transformer architecture,which is a neural network designed specifically for processing seque

21、ntial data,such as text.The Transformer architecture consists of multiple encoder and decoder layers.Each layer is composed of self-attention and feedforward sub-layers.In GPT,the input passes through the encoder layers,and the decoder layers generate the output text based on the encoded input.GPT i

22、s trained on large text datasets and is capable of generating text that closely resembles human writing.The Transformer encoder-decoder model which consists of an encoder was shown in Fig.1.Add&NormAdd&NormAdd&NormFeedforwardFeedforwardMulti-headattentionMaskedmulti-headattentionLinearSoftmaxOutputp

23、robabilitiesAdd&NormMulti-headattentionInputembeddingInputs+OutputembeddingOutputs(Shifted right)PositionalencodingPositionalencodingNxNxAdd&NormFig.1 Encoder-decoder model consists of transformer encoders图1 由Transformer编码器组成的编码器-解码器模型Currently,several domestic and international research papers5-6 h

24、ave confirmed that ChatGPT outperforms previous language models such as XLNet and ELMo in language understanding and generation.XLNet improves upon previous architectures with a permutation-based training strategy,enabling it to understand the context more comprehensively.ELMo is influential for its

25、 deep contextualized word representations,which helps capture a words meaning based on its surrounding context.ChatGPT builds upon these developments by incorporating an even larger dataset and more refined training techniques,leading to greater improvements in its linguistic capabilities.Therefore,

26、combining digital virtual humans with the ChatGPT model can significantly enhance the interactive capabilities and feedback performance of digital virtual humans with the users.By using the OpenAI-trained ChatGPT model interface,ChatGPT can receive text input from the users and generate natural lang

27、uage responses related to that input.In the Unity3D environment,a“persona”is added in the Start()method to define the characteristics of the digital virtual human.When in use,the system receives input from the user,converts the input into text,packages it as JSON data,and sends a POST request to Ope

28、nAIs GPT-3.5 model using UnityWebRequest.The OpenAI API key is included in the request header to validate access permissions.When the model returns a response message,the reply text is extracted from the response and returns to the caller through a callback function.1.3 Technical Solution for Genera

29、ting Emotion-Infused Speech Directly from TextIn order to better simulate human emotions with digital virtual humans,a technological solution is needed to generate emotion-infused speech directly from text.BERT 7,as introduced by Devlin et al.Their pioneering research employed the Transformer archit

30、ecture and underwent extensive pre-training on a vast corpus of text,including the entire English Wikipedia and BooksCorpus.This pre-training allowed it to learn rich representations of language,enabling it 2024年2期印刷与数字媒体技术研究（拼版）.indd 1252024年2期印刷与数字媒体技术研究（拼版）.indd 1252024/4/26 17:08:122024/4/26 17:

31、08:12126印刷与数字媒体技术研究2024年第2期（总第229期）to understand the context and nuance across different text domains.The BERT model,which was depicted in Fig.2,revolutionized many NLP tasks by providing a robust foundation for machines to interpret and generate human-like text,setting the stage for more emotionall

32、y responsive virtual human interactions.SEPTok NTok 1Tok 1Tok MCLSMasked sentence AMasked sentence BUnlabeled sentence A and B pairPre-trainingECLSESEPE1T1TMEME1ENBERTCT1TNTSEPNSPMask LMMask LMFig.2 Structure of BERT modle图2 BERT模型的结构By combining the GoEmotions dataset with the pre-trained BERT mode

33、l,an appropriate text emotion analysis module can be trained,hereafter referred to as BERT-GoEmotions.In this method,the text emotion analysis module utilizes BERT-GoEmotions to predict the emotion of text generated by ChatGPT.The emotion information,along with the text content,is then input into an

34、 improved emotion-aware VITS speech synthesis model to infer emotion-infused speech.This technical solution was illustrated in Fig.3.ChatGPT textEmotionalVITSEmotionGoEmotionsTextGenerateEmotional speechFig.3 A technique for directly generating text-to-speech with emotional characteristic图3 由文本直接合成带

35、有对应情感特征的语音技术In Unity3D,the text or speech content obtained from the client is sent to ChatGPT for text generation.Once the text has been generated,it is sent to the servers BERT-GoEmotions emotion prediction and emotion-aware VITS speech synthesis interface.This process retrieves emotion parameters

36、and emotion-infused speech,allowing the digital virtual human to simulate the corresponding emotions.This method effectively addresses the issue of insufficient emotional expression in virtual human interaction projects8,enabling digital virtual humans to mimic human emotions more authentically.1.4

37、Generating Emotion-Infused Speech Based on an Improved VITS MethodVITS is a conditional Variational AutoEncoder(VAE)that integrates adversarial learning to enhance the audio quality of text-to-speech models 9.This model,discussed in the study,introduced a stochastic duration predictor for synthesizi

38、ng speech with varied rhythms from textual input.This innovation allowed VITS to capture the natural diversity in speech,enabled text to be pronounced with multiple pitches and rhythms through probabilistic modeling.The core of VITS training involved maximizing a variational lower bound,which essent

39、ially balanced the likelihood of generating accurate speech outputs against the complexity of the models predictions.This was formalized as Formula(1).(1)In Formula(1),()|pxzand()|qzx represent the prior distribution and approximate posterior distribution of the latent variable z,respectively.()log|

40、pxzrepresents the likelihood function of data point x.The sum of the reconstruction loss and the KL divergence can be used as the training loss.The VITS model was illustrated in Fig.4.To incorporate emotional features,emotional content must first be added to the dataset.Wav2Vec 2.0 was developed by

41、Facebook AI as a speech-processing 2024年2期印刷与数字媒体技术研究（拼版）.indd 1262024年2期印刷与数字媒体技术研究（拼版）.indd 1262024/4/26 17:08:132024/4/26 17:08:13127研究论文LI Guang-ya et al:Research and Implementation of Digital Virtual Human Interaction Technology Based on Unity3Dmodel that extracts emotional features from speech

42、.It is converted from raw speech to high-level representations through pre-training and fine-tuning,which can be used for tasks such as emotion classification,and has shown good performance in capturing emotional information from speech.All the speech in the training set is processed to extract emot

43、ional features using Wav2Vec 2.0.Feature fusion is achieved by linearly projecting these emotional features onto the text feature in the VITS text encoder.The advantage of this approach is that it doesnt require annotation of the dataset,which allows for more diverse emotional content in speech.Howe

44、ver,one of the limitations is that audio has to be specified as the reference emotional feature during the speech inference process to produce speech with the corresponding emotion.This means that it is not supported to use the audio of a speaker that is not in the training set as the reference emot

45、ion audio.The improved VITS text encoder model structure was shown in Fig.5.The improved VITS model could be combined with the technique previously proposed in Section 1.3 for generating emotion-infused speech directly from text.The emotion analysis module utilized BERT-GoEmotions to predict the emo

46、tional content of the text and then inputs that emotion along with the text content into the improved VITS.To verify the method of generating emotion-infused speech directly from text as outlined in Section 1.3,a segment of reference audio corresponding to the emotion obtained from Bert-GoEmotions(u

47、sing clustering or manual selection,e.g.,“sad,”“angry,”“surprised”)was chosen as the reference emotional audio.This reference audio,along with the text,was fed into the improved VITS model.Following these steps,the system could output emotionally rich speech corresponding to the specified speaker ba

48、sed on the text content.Text encoderStochasticdurationpredictorProjectionFlowhtext,xlinzyff(z)f(z)ctextPhonemesNoise(1.8 1.20.9 )Ceil221221d-1Text encoderStochasticdurationpredictorProjectionFlowhtextfctextPhonemesNoisezDecoderDecoderPosteriroencoderLinearspectrogramRaw WaveformStopgradientdAMonoton

49、icalignmentsearchRawwaveformSliceFig.4 Schematic of VITS model图4 VITS模型框图htextctextchangeText encoderPhonemesText sequenceWav2Vec2Speech datasetsText embeddingEmotionalcharacteristics+xArousal,dominance,valenceFig.5 Improved VITS text encoder model diagram图5 改进的VITS文本编码器模型图示2024年2期印刷与数字媒体技术研究（拼版）.in

50、dd 1272024年2期印刷与数字媒体技术研究（拼版）.indd 1272024/4/26 17:08:132024/4/26 17:08:13128印刷与数字媒体技术研究2024年第2期（总第229期）1.5 Simulate Interactive Holographic Effects Using Kinect 2.0 DeviceKinect 2.0 10 is a depth camera developed by Microsoft,primarily used for applications in human-computer interaction such as body

展开阅读全文