收藏 分销(赏)

结合结构自适应波束形成及盲源分离语音识别智能服务机器人.doc

上传人:仙人****88 文档编号:7595553 上传时间:2025-01-10 格式:DOC 页数:20 大小:1.73MB 下载积分:10 金币
下载 相关 举报
结合结构自适应波束形成及盲源分离语音识别智能服务机器人.doc_第1页
第1页 / 共20页
结合结构自适应波束形成及盲源分离语音识别智能服务机器人.doc_第2页
第2页 / 共20页


点击查看更多>>
资源描述
结合结构自适应波束的形成和盲源分离的服务型智能语音识别机器人 摘要 在噪声的环境下,成功的智能语音识别机器人的性能取决于预处理分子的使用。尽管声波信号往往在高噪声水平的环境中遭到破坏,语音识别系统,如广泛使用的HTK,无法做到处理信号失真问题。我们建议使用一种结构,这种结构在算法空间域中有效地结合了自适应波束形成(ABF)和盲源分离(BSS)。为了避免在BSS系统中数组的歧义和多重计算的复杂性,广义的自适应旁瓣对消用于之前的BSS系统。在硬件实现中,为了进行快速处理,我们略微改进了传统的BSS卷积混合模型。不同于传统的BSS,它不会产生数组的歧义,由于前线波束形成器的目标角是固定的,所以它总是提供增强的和参考的噪声信号给BSS的两个预定义输入。在BSS中,当BSS的多于两个输入时,拟议的系统也会减少多重计算。拟议的时域方法可以容易地落实到实时硬件。我们用一个DSP模块评估其结构和性能。语音识别测试的实验结果表明,比起ABF和BSS系统,该组合系统保证了在噪声环境中较高的语音识别率和较好的性能。 1、引言 智能机器人受到越来越多的关注。如电视和报纸等媒体已经展示过双足类人机器人。由于硬件和控制技术的发展,机器人的动作已在外表上与人相似。然而,作为人与机器人互动的第一步,无误的语音识别还没能到达足以执行像人类智力一样的类人程度。成功的语音识别率只有在安静的条件和在声源靠近机器人的时候才能保证。如果噪音或干扰音量太高或在没有使用耳机的情况下机器人远离用户超过1米,语音识别率将急剧下降。 由于这些原因,语音识别率在现实世界中远低于在无声状态下。这使得它需要基准于人类的听觉识别力,某种意义上说,我们只注重辨识来源于特定方位的特殊声音的能力。我们分别用自适应波束形成(ABF)和盲源分离(BSS)技术良好的定位和分离噪音来描述这种听觉。 2、背景 2.1 波束形成 波束形成是一个信号控制方向定向化并且使该信号具有高敏感性至理想化的程度。波束形成历来用于军事目的,特别是在雷达和声纳装置的领域。最近,它被用于工程领域作为处理无线麦克风阵列通信与智能天线的信号。根据其更新性的过滤负荷,这项技术被分为常规波束形成和自适应波束形成。常规波束形成具有结构简单,计算复杂性低的特点。一个经典型常规波束形成器是延迟-结合波束形成器。在延迟-结合波束形成器中,声源被认为是遥远的从麦克风传来的并且传播的形式类似于飞机波。声音到达的方向,麦克风的数量, 和每个麦克风阵列之间的距离是优先的信息。输入语音信号根据他们的波达方向(DOA)被延迟。只有信号在所期望的方向是这样强调。然而,没有解释处理介入的干扰并且大量麦克风需要在延时-结合波束形成器中形成一个确切的束。 自适应波束形成器有自适应滤波器,可使波束形成器的负荷在形成过程中自适应改变。因为它在干扰方向自适应地制造空值不断优化性能,所以未知干扰被处理。最著名的宽带波束形成自适应波束形成器是被Frost建议的约束最小功率的自适应波束形成器。在定位时,它是能够满足某些理想频率响应,与此同时使用约束极小输出总功率的方法尽量减少输出噪声功率。Griffiths和Jim重新考虑了它并介绍了广义旁瓣对消 (GSC)。GSC由稳定的波束形成器、封锁矩阵、参考信号和多输入抵消器组成,其中,波束形成器能够产生满足期望约束的波束,封锁矩阵能够产生噪声/干扰,参考信号用于封锁有效信号,多输入抵消器试图在稳定的波束形成器输出波束时进一步取消噪声/干扰信号。GSC有一个简单的计算结构,但它在有效信号比干扰信号更强时失效。因为有效信号越强,输出总功率越大,所以导致有效信号被消除。为了改善GSC ,已提出许多方法来减少转向装置矢量误差,传感器定位误差,阵列通道失配误差等。但是, 在响应的条件下,在理论延时信道没有进行时,这些方法有其局限性。 2.2 盲源分离 盲源分离(BSS)是指一种在无信道信息时从线性混合信号中恢复源信号的方法。在声学领域,由于延时的源信号和反射回来的信号相叠加而成的混合信号称为卷积混合模型。卷积混合模型的表达式如下。 {hij,p}是第j个声源和第i个麦克风之间的脉冲响应空间,xi(t)是信号在t时刻出现在第i个麦克风。 解决卷积混合模型比实时分离混合信号需要更多参数。在复原过程中存在信号缩放比例和置换问题。如果有两个以上的信号源和两个麦克风,那么将难以解决置换的歧义性。 2.3 几何源分离 正如在上一节中所说,卷积盲源分离和自适应波束形成算法有类似的目标。这就是说,这两个算法都试图从混合传感器使用过滤器阵列获取源信号。然而,ABF和BSS算法在于信息的使用关系不同。虽然BSS使用空间和频域信息,但是自适应波束形成只利用源信号和传感器阵列的空间信息。因此BSS算法存在缩放比例和置换歧义的问题。ABF只使用二阶统计,并且有串音问题。BSS和ABF算法相互补充。这些分为以下三种不同的方法。 1 .纳入几何信息转化为卷积BSS算法 2 .制定卷积BSS的多套自适应波束形成在BSS中解决歧义 3 .利用波束形成结构和ICA成本功能 3 、提出的算法 3.1 结合方法 在本节中,我们在时域中结合卷积BSS和ABF算法并协助执行到硬件中。我们试图解决ABF的串扰问题,特别是在混响条件和置换歧义的BSS中。前面文章提出在频域结合两个算法的方法在应用到实时硬件实现中受到限制。为了克服这个问题,我们提出了一种算法,这种算法结合了时域卷积BSS和广义旁瓣抵消。从某种意义上说,人类听觉识别就是结合方案基本上为把重点放在预定语音信号源被传输的方向而采用了ABF算法(AGSC),然后为了分开观察到的信号源和干扰信号而提供了BSS算法。图1代表该结构。 图1. 被提出的结合结构 两个以上输入的卷积BSS算法在现实环境中从未被使用,这是由于高阶的复杂性。在BSS前与波束形成器(AGSC)连接,使分散源变为点源不仅大幅减少被分开的源,而且还提高BSS的卷积算法的性能。 3.2 采用自适应波束形成算法 我们描述被Changkyu Choi提议的自适应广义旁瓣对消器。结构如图2所示。而自适应广义旁瓣对消器(AGSH)在前反馈与自适应阻塞滤波器和自适应对消滤波器连接,拟议自适应波束在反馈中已经包含它们。拟议就够的优势在于它减少了过滤器的数量,并取得了和AGSC使用大量过滤器同样的效果。 图2. 自适应广义旁瓣对消器 四信道信号分离模块的输入包括延时-结合波束形成器输出b(k)和四个麦克风的输入一致信号x'(k)。四输入麦克风阵列有四个延时输入。增强的信号是从四个延时输入信号中除去自适应参考噪声信号后获得的。在自适应阻塞过滤器中,在每个信道中从每个延时输入除去增强的实时信号。自适应参考噪声信号由自适应对消过滤器中叠加四个自适应过滤噪音信号形成。第m个ABF和第m个ACF分别由hm(k)和gm(k)表示。拟议结构的运算有以下公式表示: zm(k)是第m个ACF的输入,m=1,……,M,M是多个麦克风阵列的元素,k是现在的时间标志,xm'(k)是第m个麦克风的一致信号,hm(k)是第m个ABF信号的变化率矢量,hm,l(k)是hm(k)的第l个变化率,l=1,……,L,L是ABF数量的最大值,u(k)是集合了L个之前u(k)值的矢量,v(k)是ACF输出的总和,g(k)是第m个ACF的系数向量,gm,n(k)是gm(k)的第n个系数向量,n=1,……,N,N是ACF数量的最大值,zm(k)是集合了N个之前zm(k)值的矢量,u(k)是由拟议自适应滤波器产生的末级输出语音。信息最大化识别规则常用于ABF和ACF的系数更新。 3.3 应用BSS卷积算法 对于时域卷积BSS算法,我们采用在本实验室中提议的并联反馈网络体系结构算法。这种并联反馈网络体系结构算法可封装在一个模块中。图3表示该结构。 图3. IMCL 并联反馈网络 BSS 如方程(1)所示 ,当p是等于0时卷积混合模型形成。 由于在t时刻方程(10)需要大量计算,我们必须分组yi(t)和yj(t)到方程右侧。 然而,如果我们忽视时p等于零,也就是说,在一个采样时间内输出中的一个不影响其他,方程(10)可以简化为如下。 在起初的算法中,对于DSP操作这一修改使计算速度很快,因为我们减少了繁琐的乘法和反演操作的次数。请注意,在方程(10)中,[I-W0]-1和∑LP=0Wp(t)y(t-p)相乘是必要的,而方程(12)则不是 。 由PC和DSP仿真产生,起初的输出波和修改算法之间的差异可忽略不计,而修改算法产生比原来大约快百分之三十的输出。我们期待着如果使用FPGA或AGIC由于其电路简化,能实现更多的性能改进。 3.4 拟议组合结构 我们建议服务机器人的预处理部分的组合结构用于嘈杂的环境中。这个结构有两个阶段,由自适应广义旁瓣对消器和卷积BSS系统组成。 如前所述,从AGSC中输出的增强语音信号与BSS系统的输入相连,并且AGSC中其他输出的参考信号与BSS系统中的其他输入相连。用串行连接这两个系统的方法,该系统不会产生置换歧义和由原始卷积BSS系统所产生的复杂计算复,并且该系统对于来源于传统的自适应波束形成器的混响和串扰问题具有定向的敏感性和鲁棒性。 图4. 拟议组合结构 4、实验和评价 我们在噪音条件下用DSP处理模块评估拟议ABF/BSS组合结构,并用语音识别测试的方法在ABF和BSS中比较它。在本实验中自适应阻挡滤波器和自适应对消过滤器AGSC最多使用64次。并联反馈网络BSS系统在我们实验室中由于BSS与AGSC的输出相连而得到发展。我们的计划融合组合结构进入DSP。在实验中所用DSP是TMS320C6713,出产于德克萨斯州,该仪器工作在225MHz。所有的实验数据记录在16kHz。对于语音识别系统,我们使用基于隐藏Marcov模型工具包的triphoneme。来源韩语组成的词不少于三个音节。 经训练的词语数目是512,但我们只使用其中的100个。实验环境如图.5。 图5. 实体空间和麦克风, 噪音和说话人位置的图解 四个一行组成的麦克风阵列位于屋子的正中央。这些麦克风是全方位并且相距0.1米。说话的人和中央的麦克风相距1米,噪音是与说话者相距1米成60度角。对于BSS测试,我们使用两个麦克风,对于ABF和拟议算法用四个麦克风。在测试前,我们测试安静条件下的的噪音水平是26分贝, 并假设该条件是无声的。将记录的水平与说话者的人为噪音相比较。 工厂噪声用于实验中。对于ABF,BSS和我们的组合结构每个人说100个次3次。共五名男性参加本次试验。在本次试验中,我们使用了6个不同的噪声水平。在每个噪声条件下,我们计算出五个人的平均语音识别率。图.6表示每个算法的性能。 图 6. ABF, BSS和组合系统的性能结果 在安静的条件下,对于ABF,BSS和组合算法的语音识别率测试结果相类似。这意味着在此条件下预处理部分在整个语音识别系统起着巨大的作用。然而,随着我们增加的噪音水平,我们确认,在有预处理系统的两个系统之间识别率没有差别。总之,对于ABF(AGSC)和BSS,组合算法体现出较高的语音识别率。我们注意到在16分贝的信噪比水平下,拟议算法成功的语音识别率为84%,而ABF和BSS的识别率分别仅为79%和78%。在11分贝的信噪比水平下,HTK由于高噪音没有检测声音的终点,而无法工作。因此,我们仅在11分贝的信噪比水平下对于ABF,BSS和拟议算法使用语音活动探测器(VAD)。VAD的活动范围是在为了使HTK 工作的声音和噪音之间调节。 5.结论 对于智能服务机器人语音识别系统的预处理部分,我们建议的组合架构是适当的。在现实世界中嘈杂的环境中, 对于机器人该命令的用户位置往往发生损坏,造成机器人不能理解。作为一种解决这个问题的办法,我们引进并使用ABF和BSS的组合算法。我们模仿这种机制,在该机制中一个人集中激励于他或她想要听的声音。首先,为了捕捉所希望方向的信号,我们使用ABF算法,然后我们使用BSS算法进行声音分离。虽然议结构由BSS算法组成,但不同于传统BSS的是当输入多余两个时,它解决了置换歧义的问题,并降低了计算强度。此外,从这个结构上该系统具有方向敏感性。 Combined Architecture of Adaptive Beamforming and Blind Source Separation for Speech Recognition of Intelligent Service Robots Abstract Successful speech recognition in noisy environments for intelligent robots depends on the performance of preprocessing elements employed. Even though acoustic signals are often corrupted in the high noise level environment, speech recognition systems such as the widely-used HTK do not deal with signal distortion problems. We propose an architecture that effectively combines adaptive beamforming (ABF) and blind source separation (BSS) algorithms in the spatial domain. To avoid permutation ambiguity and heavy computational complexity in the BSS system, the adaptive generalized sidelobe canceller is employed in front of the BSS system. We slightly modified the conventional convolutive mixture model of the BSS for fast processing in hardware implementations. Unlike the conventional BSS, this does not suffer from permutation ambiguity since the target angle of the front-line beamformer is fixed so it always provides enhanced and reference noise signals to the predefined two inputs of the BSS. The proposed system also reduces heavy computations in the BSS when the BSS have more than two inputs. The proposed time domain approach can be easily implemented into hardware in real-time. We evaluated the structure and assessed its performance with a DSP module. The experimental results of speech recognition test show that the proposed combined system guarantees high speech recognition rate in the noisy environment and better performance than the ABF and BSS system. 1. Introduction Intelligent robots are growing concerns. The media such as TV and newspapers have shown humanoid robots that can walk on two legs. Due to the development in hardware and control technologies, a robot’s motion has become a semblance of a human’s one. However, errorless speech recognition, a first step to human-robot interaction, has not been achievable enough to implement to real humanoids which resemble humans’ intelligence. Successful speech recognition rate is only guaranteed in the silent condition and in the case of proximal utterance to robots. If the noise or interfering sound level is too high or the robot is far away from the user even more than 1 meter without using a headset, the speech recognition rate will drastically be degraded. For these reasons, speech recognition rate in the real world is much lower than the one in the silent condition. This makes it necessary to benchmark a human’s auditory recognition sense, the capability with which we focus only on the particular sound originating from a specific direction. We describe this auditory capability with adaptive beamforming (ABF) and blind source separation (BSS) techniques for sound localization and separation from noise, respectively. 2. Backgound 2.1 beamforming Beamforming is a signal localization scheme that controls the directionality and gives high sensitivity to desired signals. Beamforming has been traditionally used on the military purposes, especially in the area of RADAR and SONAR. Recently, it is being used in such engineering areas as signal processing with a microphone array and wireless communication with a smart antenna. This technique is divided into conventional and adaptive beamforming according to the updatability of filter weights. Conventional beamformer has a simple structure and low computational complexity. One of the classic-type conventional beamformers is delay-and-sum beamformer. In the delay-andsum beamformer, sound sources are assumed to be distant from microphones and transmitted as the form of plane waves. Direction of sound arrival, the number of microphones, and the distance between each array microphone are prior information. Input speech signals are delayed according to their direction-of-arrival (DOA). Only the signal in the desired direction is thus emphasized. However, no explanation to deal with interference is included and a number of microphones are in need to form an exact beam in the delay-and-sum beamformer. Adaptive beamformer has adaptive filters so that the beamformer’s weights can be adaptively changed in the middle of process. Since it constantly optimizes the performance by adaptively making nulls in the direction of interference, unknown interference is dealt with. The most famous adaptive beamformer for wideband beamforming is the constrained minimum power adaptive beamformer proposed by Frost. It is capable of satisfying certain desired frequency response in the looking direction while minimizing the output noise power by using constrained minimization of the total output power. Griffiths and Jim reconsidered it and introduced a generalized sidelobe canceller (GSC). The GSC consists of a fixed beamformer which satisfies the desired constraint, and blocking matrix which produces the noise/interference, only reference signal by blocking the desired signal and a multi-input canceller which attempts to further cancel the noise/interference signal in the fixed beamformer output. The GSC has a simple calculation structure, but it is not effective when the desired signal is much stronger than the interfering signals. This results in the elimination of the desired signal because the stronger the desired signal it is, the more it contributes to total output power. To improve the GSC, numerous methods have been proposed to reduce steering vector error, sensor location error, array channelmismatch, etc. However, in the reverberant condition, where pure delay channels do not hold, these approaches have limitations. 2.2 Blind Source Separation Blind source separation (BSS) refers to an approach to recover source signals from linear mixtures without channel information. Due to the time-delayed superposition of the sources and reflections from the walls, the mixtures are described as a convolutive mixture model in the field of acoustics. The convolutive mixture model is expressed in the following form. where {hij,p} is the room impulse response between the (j)th source and the (i)th microphone and xi(t)is the signal present at the (i)th microphone at time instant t. The solution of the convolutive mixture models involves more parameters than when separating instantaneous mixtures. The scaling and permutation problems are left in the recovering process. If there are more than two sources and two microphones, it is difficult to solve permutation ambiguity. 2.3 Geometrical Source Separation As shown in the previous section, convolutive blind source separation and adaptive beamforming algorithms have the similar goal. That is to say, both algorithms attempt to obtain source signals from sensor mixtures using filter arrays. However, ABF and BSS algorithms are different in terms of the information they employ. While BSS uses both spatial and frequency domain information, adaptive beamforming uses only spatial information of source signals and a sensor array. Therefore, BSS algorithm suffers from scaling and permutation ambiguity. ABF uses only second-order statistics and has a cross-talk problem. The schemes combining BSS and ABF algorithms have shown a complement to each other. These are divided into the following three different approaches. 1. Incorporation of geometrical information into convolutive BSS algorithm 2. Formulation of convolutive BSS as multiple sets of adaptive beamforming to resolve ambiguities in BSS 3. Utilization of the beamforming structure and ICA cost function 3. Proposed Algorithm 3.1 Combining method In this section, we combine convolved BSS and ABF algorithms in the time-domain and facilitate the implementation into hardware. We attempt to solve the drawback of ABF which is a cross-talk problem especially in a reverberation condition and permutation ambiguity of BSS. Previous papers proposing a combining method of the two algorithms in the frequency-domain have a limited application into real-time hardware implementation. To overcome this problem, we propose an algorithm which combines the time-domain convolved BSS and the Generalized Sidelobe Canceller. In reference to the human auditory recognition sense, combining scheme is basically to employ ABF algorithm (AGSC) to focus on the predefined direction where the source speech signal is delivered, and then apply BSS algorithm to separate the observed signals into source and jammer signals. The Fig. 1 represent the structure. Figure 1. Proposed Combining Structure Convolutive BSS algorithm with more than two inputs has not been used in real world situations due to high order of complexity. With a beamformer(AGSC) connected in front of the BSS, we not only drastically reduce the number of sources to be separated, but also improve the performance of the convolutive BSS algorithm by making distributed sources into point sources. 3.2 Applied Adaptive Beamforming Algorithm We represent the adaptive generalized sidelobe canceller proposed by Changkyu Choi. The structure is shown in Fig. 2. While the adaptive generalized sidelobe canceller (AGSC) connects the adaptive blocking filters and the adaptive cancelling filters in feed forward, the proposed adaptive beamformer has them in feedback. The advantages of the proposed structure are the reduced number of filter taps that give the same speech quality compared to the AGSC with a larger number of filter taps. Figure 2. Adaptive Generalized Sidelobe Canceller The input of the four channel signal separation block consists of the delay-and-sum beamformer output b(k) and the time aligned signals x'(k) of four microphone inputs. The four input microphone array has four time-delayed inputs. The enhanced signal is acquired subtracting the adaptive reference noise sig
展开阅读全文

开通  VIP会员、SVIP会员  优惠大
下载10份以上建议开通VIP会员
下载20份以上建议开通SVIP会员


开通VIP      成为共赢上传

当前位置:首页 > 教育专区 > 小学其他

移动网页_全站_页脚广告1

关于我们      便捷服务       自信AI       AI导航        抽奖活动

©2010-2025 宁波自信网络信息技术有限公司  版权所有

客服电话:0574-28810668  投诉电话:18658249818

gongan.png浙公网安备33021202000488号   

icp.png浙ICP备2021020529号-1  |  浙B2-20240490  

关注我们 :微信公众号    抖音    微博    LOFTER 

客服