1、 重庆大学本科学生毕业设计(论文)附件 附件D:FPGA IMPLEMENTATION OF DIGITAL FILTERSChi-Jui Chou, Satish Mohanakrishnan, Joseph B. EvansTelecommunications& Information Sciences LaboratoryDepartment of Electrical & Computer EngineeringUniversity of KansasLawrence, KS 66045-2228ABSTRACTDigital filtering algorithms are most
2、 commonly implemented using general purpose digital signal processing chips for audio applications, or special purpose digital filtering chips and application-specific integrated circuits (ASICs) for higher rates. This paper describes an approach to the implementation of digital filter algorithms ba
3、sed on field programmable gate arrays (FPGAs).The advantages of the FPGA approach to digitalfilter implementation include higher sampling rates than are availablefrom traditional DSP chips,lower costs than an ASIC for moderate volume applications, and more flexibility than the alternate approaches.S
4、ince many current FPGA architectures are in-system programmable, the configuration of the device may be changed to implement differentfunctionality if required. Our examples illustrate that the FPGA approachis both flexible and provides performance comparable or superior to traditional approaches.1.
5、 INTRODUCTIONThe most common approaches to the implementation of digital filtering algorithms are general purpose digital signal processing chips for audio applications, or special purpose digital filtering chips andapplication-specific integrated circuits (ASICs)for higher rates 9, 14. This paper d
6、escribes an approach to the implementation of digital filter algorithms on field programmable gate arrays (FPGAs).Recent advances in FPGA technology have enabled these devices to be applied to a variety of applications traditionally reserved for ASICs. FPGAs are well suited to datapath designs,such
7、as those encountered in digital filteringapplications. The density of the new programmable devices is such that a nontrivial number of arithmetic operations such as those encountered in digital filtering may be implemented on a single device. The advantages of the FPGA approach to digital filter imp
8、lementation include highersampling rates than are available from traditional DSP chips, lower costs than an ASIC for moderate volume applications, and more flexibility than the alternate approaches.In particular, multiple multiply-accumulate(MAC) units may be implemented on a single FPGA, which prov
9、ides comparable performance to general-purpose architectures which have asingle MAC unit. Further, since many current FPGA architectures arein-system programmable, the configuration of the device may be changed to implement alternate filtering operations, such as lattice filters andgradient-based ad
10、aptive filters, or entirely different functionality.2. BACKGROUNDResearch on digital filter implementation has concentrated on customimplementation using various VLSI technologies. The architecture ofthese filters has been largely determined by the target applications of the particular implementatio
11、ns. Several widely used digital signal processors such as the Texas Instruments TMS320,Motorola 56000, and Analog Devices ADSP-2100 families have been designed to efficiently implement filtering operations at audio rates. These devices are extremely flexible, but are limited in performance. High per
12、formance designs for filtering at sampling rates above 100 MHz have also been demonstrated using CMOS3, 4, 6, 8, 9, 14, 17, 19, 20, 21 and BiCMOS 8, 20, 22 technologies,using approaches ranging from full custom to traditionalfactory-configured gate arrays. These efforts have produced highperformance
13、 designs for specific application domains.There are several potential shortcomings of the custom VLSI approach, although it does promise the best performance and efficiency for the specific application for which a particular design is intended. The most obvious problem is the lack of flexibility in
14、the custom approach. Custom devices are often suited only for use in a particular application, and can not be easily reconfigured for other operations even within that same domain. Another Problem which the customVLSI approach often imposes is a lack of adaptability once a device is in use within a
15、system. Typical custom approaches do not allow the function of a device to be modified within the system, for purposes such as correcting faults, for example. Although these problems can be overcome with sufficient forethought, the costs in performance, implementation complexity, and additional desi
16、gn time often preclude flexible solutions. Lack of flexibility can forestall the cost-effective evaluation of exotic algorithms in a high performance real-time environment. Only highvolume applications or extremely critical low volume applications can justify the expense of developing a full custom
17、solution. There are a variety of algorithms which are not within the performance envelope of general purpose processors,and which are not sufficiently commonplace or well-understood to justify implementation in a full custom design. These algorithms cannot be evaluated with the traditional approache
18、s, thus limiting innovation.Field programmable gate arrays (FPGAs) can be used to alleviate some of the problems with the custom approach. FPGAs are programmable logic devices which bear a significant resemblance to traditional custom gate arrays. While there are a variety of approaches to FPGAimple
19、mentation, some of the more popular series consist of an array of arbitrarily programmable function blocks, with configurable routing resources which are used to interconnect these blocks. Many of the most popular FPGAs are in-system programmable, which allows themodification of the operation of the
20、 device through simplereprogramming.The primary limitations of FPGAs are related to the overhead imposed by programmability. In particular, the density of the devices is only now reaching the level necessary to implement complete modules of reasonable complexity. Other difficulties associated with t
21、he devices result from the constraints imposed by the architecture, such as limitations on the logic functions which may be implemented in each logic block, and routing delays in the array. Many of these difficulties can be overcome by careful design.Due to ever-increasing integrated circuit fabrica
22、tion capabilities,the future of FPGA technology promises both higher densities and higher speeds. Many FPGA families are based on memory technology, so the improvements in those areas should correlate with FPGA evolution. The expanded use of FPGAs in a variety of challenging application domains is t
23、hus likely.FPGAs are well suited for the implementation of fixed-point digital signal processing algorithms. The advantages of DSP on FPGAs are primarily related to the additional flexibility provided by FPGA reconfigurability. Not only can high-performance systems be implemented relatively inexpens
24、ively, but the design and test cycle can be completed rapidly due to the elimination of the integrated circuit fabrication delays. The new approach also allows adapting the functions to account for unforeseen requirements.The problems of DSP on FPGAs are related to the density and routing constraint
25、s imposed by the FPGA architectures. In particular,the number of logic gates which may be implemented on an single device, and hence the number of arithmetic units, is still limited, and the routing between modules on an array imposes the critical delay limitations.Because of the constraints imposed
26、 by FPGAs, implementation of digital filter algorithms through this medium must initially focus on efficient structures which possess low complexity 2.Concurrent design of efficient digital filter algorithms and FPGA implementations is necessary to take full advantage of the new capabilities.In this
27、 particular work, Xilinx XC4000-series FPGAs were used to implement various digital filter algorithms and evaluate their performance. A Xilinx XC4000 consists of an array of configurable logic blocks (CLBs), each of which has several inputs(F1-F4, G1-G4) and outputs (X,Y and XQ,YQ). Each CLB can con
28、tain both random logic and synchronous elements. In addition to the general-purpose logic functions, each CLB also contains special fast carry logic for addition operations. The XC4000-series contains both local and global routing resources. The local resources allow extremely low delay interconnect
29、ion of CLBs within the same neighborhood, as well as more extended connectionthrough the use of switching matrices. The global resources provide for the low-delay distribution of signals that are used at widely-spaced points in the array. The speed of a particular application is highly dependent on
30、routing in the Xilinx FPGAs. The XC4000 family includes parts ranging from 8 by 8 CLB arrays to 24 by 24 CLB arrays. All of these devices are in-system programmable.Low power versions of many of these parts are also available.3. MULTIPLY-ACCUMULATE UNITSSeveral authors 1, 11, 12, 13 have identified
31、the multiply accumulate(MAC) operation as the kernel of various digital signal processing algorithms. A variety of approaches to the implementation of the multiplication and addition portions of the MAC function are possible 7, 10. This work will focus on the realization of multiplication using an a
32、rray approach and addition using ripple carry methods, although other methods are equally applicable to the FPGA domain.The structure of a MAC unit is illustrated in Figure 1. The MAC unit presented in this section consists of an 8-bit by 8-bit combinatorial array multiplier and a 16-bit accumulator
33、. These word sizes were chosen to balance the size of the implementation,which is limited by the FPGA density, against the numerical precision. Larger word sizes are possible if the number of MAC units per chips is reduced. The increase in density of FPGAs in the future will certainly expand the des
34、ign space available to the designer, and make such constraints less severe.3.1. Implementation of MultiplierThe combinatorial multiplier uses one CLB per partial product bit.A 2-inputAND gate generates each partial product, but additional circuitry is required to add together all partial products of
35、 equal weight. The total number of CLBs used for the multiplier in this case is 64 and the basic cell structure is illustrated in Figure 2.Each cell is configured as a full adder (except for the type A cell). This full adder accepts a sum and a carry from a previous operation of equal weight, as sho
36、wn in Figure 2, and the logical AND of the inputs xi and ai.The sum and carry generated by the adder are then sent to the CLBs of proper weight as shown in Figure 3. The multiplier has been configured to perform multiplication of signed numbers in twos complement notation. The small circles in the f
37、igure indicate negative inputs or outputs; such bits have to be subtracted rather than being added. The cells in the leftmost column of the array only AND their two inputs and generate the product. If one of the two inputs has a negative weight, then the output will have a negative weight. The conve
38、ntional 1-bit full adder assumes positive weights on all of its 3 inputs and 2 outputs.Such an adder can be generalized to four types of adder cells by attaching positive and negative weights to the input/output pins as discussed in 7. Figure 4 lists the logic symbols for the fourtypes of generalize
39、d full adders.The Boolean equations governing the Type 0 and 3 full adders areand those for the Type 1 and 2 adders areType 0 and Type 3 full adders are characterized by the same pair of logic equations, identical to that of the conventional 1-bit full adder (Type 0). This is becausea Type 3 full ad
40、der can be obtained from a Type 0 full adder by negating all of the input and output values and vice versa. A similar relationship can be established between Type 1 and Type 2 full adders. For Type 0, 1, 2, and 3 full adders, the two independent 4-bit functions were used to generate the sum and carr
41、y outputs. We can easily include the AND gate in the CLB just by replacing, for example, X with(xi and ai) when configuring the CLB. The horizontal inputs(xi,ai) can use the horizontal longlines which are associated with each row for distribution of the signal with a very short routing delay. Other
42、interconnections can be made using the single-length or double-length lines via ProgrammableInterconnection Points (PIP) or switching matrices.3.2. Adder ImplementationIn the XC4000 series, each CLB includes high-speed carry logic that can be activated by configuration. The two 4-input functionGener
43、ators may be configured as a 2-bit adder with built-in hidden carry that can be expanded to any length. The 16-bit adder in our MAC unit, which uses the dedicated carry logic, requires nine CLBs. The middle 14bits use 7 CLBs, one CLB is used for the MSB, and one is used for the LSB of the adder. For
44、 each CLB in the middle section, the F function is used for lower-order bit and the G function is used for higher-order bit. Obviously, we need to use the G function for the LSB bit and F function for the MSB bit. In the case of the LSB CLB, two values must be input on the G1 and G4 pins. The carry
45、signal enters on the F1 pin,propagates through the G carry logic, and exits on the COUT pin. The F function of this CLB is not used and can be used for other purposes. For the middle CLBs, the logic is configured to perform a 2-bit addition of A+B in both the F and G functions,with the lower-order A
46、 and B inputs on the F1 and F2 pins, andthe higher-order A and B inputs on the G1 and G4 pins. The carry signal enters on the CIN pin, propagates through the F and G carry logic, and exits on the COUT pin. For the MSB CLB, the two values must be input on F1 and F2 pins. The carry signal enters on th
47、e CIN pin, propagates through the F carry logic, and exits on the COUT pin. The G function generator of this CLB is used to access the carry out signal or calculate a twos complement overflow.The limitation of using this built-in carry logic is that the carry out (COUT) pin of a CLB can only be conn
48、ected to the carry in (CIN) pin of the CLBs above or below. Thus the adder using fast carry logic can only be configured vertically in the array.The dedicated carry circuitry greatly increases the efficiency and performance of adders. Conventionalmethods for improving performance such as carry gener
49、ate/propagate are not useful even at 16-bit level, and are of marginal benefit at longer wordlengths.In our case, the 16-bit adder has a combinatorial delay of only 20.5 ns.3.3. MAC ImplementationWe use the most significant 8 output bits of the multiplier as the input to the low order bits of the adder. The 8-bit input of the adder is sign-extended and added with prev