viernes, 30 de septiembre de 2011

CHAPTER 3 Hardware Platforms for Real-Time Image and Video Processing

CHAPTER 3

Hardware Platforms for Real-Time Image and Video Processing

3.1 INTRODUCTION

A great deal of the present growth in the field of image/video processing is primarily due to the ever-increasing performance available on standard desktop PCs, which has allowed rapid development and prototyping of image/video processing algorithms. The desktop PC development environment has provided a flexible platform in terms of computation resources including memory and processing power. In many cases, this platform performs quite satisfactorily for algorithm development. The situation changes once an algorithm is desired to run in realtime.

This involves first applying algorithmic simplifications as discussed in Chapter 2, and then writing the algorithm in a standard compiled language such as C, after which it is ported over to some target hardware platform. After the algorithmic simplification process, there are different possible hardware implementation platforms that one can consider for the real-time implementation. The selection of an appropriate hardware platform depends on the responses provided to the following questions:

What are the important features of an image/video processing hardware platform?

What are the advantages and disadvantages associated with different hardware platforms?

What hardware platforms have previously been used for the real-time application under consideration?

What kind of hardware platform is best suited for the real-time application under consideration?

These questions will be examined in the sections that follow in this chapter.

3.2 ESSENTIAL HARDWARE ARCHITECTURE FEATURES

As discussed in Chapter 1, practical image/video processing systems include a diverse set of operations from structured, high-bandwidth, data-intensive, low-level and intermediate-level operations such as filtering and feature extraction, to irregular, low-bandwidth, control-intensive, high-level operations such as classification. Since the most resource demanding operations in terms of required computations and memory bandwidth involve low-level and intermediate level operations, considerable research has been devoted to developing hardware architectural features for eliminating bottlenecks within the image/video processing chain, freeing up more time for performing high-level interpretation operations. While the major focus has been on speeding up low-level and intermediate level operations, there have also been architectural developments to speed up high-level operations.

From the literature, one can see there are three major architectural features that are essential to any image/video processing system, namely single instructionmultiple data (SIMD), very long instruction word (VLIW), and an efficient memory subsystem. The concept of SIMD processing is a key architectural feature found in one way or another in most modern real-time image/video processing systems [20, 35, 65]. It embodies broadcasting a single instruction to multiple processors, which simultaneously execute the instruction on different portions of data in parallel, thus allowing more computations to be performed in a shorter time [65]. This mode of processing fits low-level and intermediate level operations well as they require applying the same operation to different pixel data. Naturally, SIMD can also be used to speed up matrix–\vector operations.

The SIMD concept has been used extensively since the 1980s as evident from its widespread use in vision accelerator boards, instruction set extensions for general-purpose processors (GPPs), and packed data processing of digital signal or media processors. In fact, the most common instantiation of the concept of SIMD in today’s GPPs, digital signal and media processors, is in the form of the packed data processing extension, which is also known as subword parallelism or wordwide data optimization [32, 65, 81, 124]. These extensions have primarily been developed to help speed up the processing of multimedia data. Since pixel data are usually represented by 8-bits or 16-bits, and since most modern processors have 32-bit registers, packed data processing allows packing four 8-bit pixels or two 16-bit pixels into 32-bit registers and then issuing an instruction to be performed on the individual 8-bit or 16-bit pixels at the same time. These types of packed data instructions not only alleviate the computation burden of low-level and intermediate-level operations, but also help to reduce memory access bottlenecks because multiple pixel data can be read using one instruction. Packed data processing is a basic form of SIMD. In general, SIMD is a useful tool for speeding up low, intermediate, and matrix–vector operations on modern processors. Thus, one can think of SIMD as a tool for exploiting data level parallelism (DLP).

While SIMD can be used for exploiting DLP, VLIWcan be used for exploiting instruction level parallelism (ILP) [65], and thus for speeding up high-level operations [20]. VLIW furnishes the ability to executemultiple instructions within one processor clock cycle, all running in parallel, hence allowing software-oriented pipelining of instructions by the programmer. Besides the fact that for VLIW to work properly there must be no dependencies among the data being operated on, the ability to execute more than one instruction per clock cycle is essential for image/video processing applications that require operations in the order of giga operations per second [20].

Of course, while SIMD and VLIW can help speed up the processing of diverse image/video operations, the time saved through such mechanisms would be completely wasted if there did not exist an efficient way to transfer data throughout the system [35]. Thus, an efficient memory subsystem is considered a crucial component of a real-time image/video processing system, especially for low-level and intermediate-level operations that require massive amounts of data transfer bandwidth as well as high-performance computation power. Concepts such as direct memory access (DMA) and internal versus external memory are important.DMAallows transferring of data within a system without burdening the CPU with data transfers. DMA is a well-known tool for hiding memory access latencies, especially for image data.

Efficient use of any available on-chip memory is also critical since such memory can be accessed at a faster rate than external memory. More discussion on memory usage optimization techniques are mentioned in Chapter 4.

One key problem with current memory subsystems is that they were originally designed for one-dimensional data access and thus cannot properly address the spatial locality necessary for two-dimensional or three-dimensional image data. Some researchers have dealt with this problem by designing custom memory addressing schemes that allow for more efficient memory access of image data, as will be seen in the examples section of this chapter. Before stating these examples, it is useful to mention an overview of the standard processor architectures and their advantages/disadvantages for real-time image/video processing.

3.3 OVERVIEW OF CURRENTLY AVAILABLE PROCESSORS

3.3.1 Digital Signal Processors

Digital signal processors are well known for their high-performance, low-power characteristics, and relatively small size, which enable them to accelerate computationally intensive tasks on embedded devices.While it may have been true in the past that digital signal processors (DSPs) were not suitable for processing image/video data in that they could not meet real-time requirements for video rate processing, this is no longer the case with newly available high-performance DSPs that contain specific architectural enhancements addressing the data/computation throughput barrier.

DSPs have been optimized for repetitive computation kernels with special addressing modes for signal processing such as circular or modulo addressing. This helps to accelerate the critical core routines within inner loops of low- and intermediate-level image/video processing operations. In many DSP implementations, it is observed that a large percentage of the execution time is due to a very low percentage of the code, which simply emphasizes the fact that DSPs are best for accelerating critical loops with few branching and control operations, which are best handled by a GPP [1]. DSPs also allow saturated arithmetic operations that are useful in image/video processing to avoid pixel wraparound from a maximum intensity level to a minimum level or vice versa [124]. DSPs possess either a fixed-point or a floating-point CPU, depending on the required accuracy for a given application. In most cases, a fixed-point CPU is more than adequate for the computations involved in image/video processing.

DSPs also have predictable, deterministic execution times that constitute a critical feature for ensuring that real-time deadlines are met. In addition,DSPs have highly parallel architectures with multiple functional units and VLIW/SIMD features, further proving their suitability for image/video processing. DSPs have been designed with high memory bandwidth in mind, onchip DMA controllers, multilevel caches, buses, and peripherals, allowing efficient movement of data on- and off-chip from memories and other devices. DSPs support the use of realtime operating systems (RTOSs), which again help in guaranteeing that critical system level hard real-time deadlines are met. Of course, DSPs are fully programmable, which adds to their inherent flexibility to changes in algorithm updates. Modern development tools such as efficient C code compilers and use of hardware-specific intrinsic functions have supplanted the need to generate hand-coded assembly for all but the most critical core loops, leading to more efficient development cycles and faster time-to-market.

Indeed, DSPs contain specific architectural features that help one to speed up repetitive, compute-intensive signal processing routines, making them a viable option for inclusion in a real-time image/video processing system. That is why DSPs have been used in many real-time image/video processing systems. More recently, DSPs have been included as a core in dual-core processor system-on-chips for consumer electronics devices such as PDAs, cell phones, digital cameras, portable media players, etc.

3.3.2 Field Programmable Gate Arrays

Field programmable gate arrays (FPGAs) are arrays of reconfigurable complex logic blocks with a network of programmable interconnects [148]. The amount of gates and capabilities of FPGAs are expected to continue to grow in future generations. FPGAs allow fully applicationspecific custom circuits to be designed by using a software programming language known as hardware description language (HDL). They provide precise execution times helping to meet hard real-time deadlines. FPGAs can be configured to interface with various external devices. Since they are reprogrammable devices, they are flexible in the sense that they can be reconfigured to form a completely different circuit. Current generation FPGAs can be either fully reconfigured or partially reconfigured, with reconfiguration times of less than 1 ms, making it possible to have a dynamic run-time reconfiguration. This configuration is useful for reducing system size of embedded devices. The interested reader can refer to [2, 14, 22, 83, 140] for more information on run-time reconfigurations of FPGAs for image/video processing applications.

Due to their programmable nature, FPGAs can be programmed to exploit different types of parallelism inherent in an image/video processing algorithm. This in turn leads to highly efficient real-time image/video processing for low-level, intermediate-level, or high-level operations, enabling an entire imaging system to be implemented on a single FPGA. In general, FPGAs have extremely high memory bandwidth. As a result, one can use custom memory configurations and/or addressing techniques to exploit data locality in high-dimensional data.

In many cases, FPGAs have the potential to meet or exceed the performance of a single DSP or multiple DSPs. FPGAs can be thought of as combining the flexibility of software programmability with the speed of an application-specific circuit (ASIC) within a shorter design cycle or time-to-market. Often an FPGA implementation is the first step toward transitioning to an ASIC, or in some cases the final product. However, there is a disadvantage associated with FPGAs, that is, their energy or power consumption efficiency. Lately, low-power FPGAs are becoming more available.

In essence, FPGAs have high computational and memory bandwidth capabilities that are essential to real-time image/video processing systems. Because of such features, there has been an increasing interest in using FPGAs to solve real-time image/video processing problems [38]. FPGAs have already been used to solve many practical real-world, real-time image/video processing problems, from a preprocessing component to the entire processing chain.

FPGAs have also been used in conjunction with DSPs. A current trend in FPGAs is to include a GPP core on the same chip as the FPGA for a customizable system-on-chip (SoC) solution.

3.3.3 Multicore Embedded System-on-Chip

In the consumer electronics market, there has been a drive toward single-chip solutions or SoCs for portable embedded devices, which require high performance computation and memory throughput coupled with low power consumption in order to meet the real-time image/video processing constraints of battery-powered products such as digital cameras, digital video camcorders, cell-phone-equipped cameras, etc. These systems exhibit elegant designs where one can learn how the industry has approached the battery-powered embedded real-time image/video processing problem.

For example, consider the TMS320DM320 “digital media processor” manufactured by Texas Instruments [159]. This is a multiprocessor chip with a reduced instruction set (RISC) microprocessor coupled with a low-power fixed-point DSP. The RISC microprocessor serves as the master handling system control, running a RTOS and providing the necessary processing power for complex control-intensive operations. The DSP, acting as a slave to the RISC, is a low-power component for performing computationally intensive signal processing operations.

The presence of a memory traffic controller allows achieving a high-throughput access to memory. In this device, the RISC and DSP are accompanied by a set of parameter customizable application-specific processors that provide a “boost,” that is to say, they provide the extra computational horsepower that is necessary to perform functions such as real-time LCD preview (Preview Engine) and real-time computation of low-level statistics necessary for autoexposure, autowhite balance, and autofocus (H3A Engine). The DSP along with its accelerators and dedicated image processing memory buffers provides a high-computation throughput and memory bandwidth for performing various image/video processing related functions such as rendering the final captured image through the image pipeline and running image/video compression routines.

By examining this architecture, one can see that this SoC has been designed with a DSP plus dedicated hardware accelerators for low-level and intermediate-level operations along with aGPPhardware for more complex high-level operations. This is an illustrative example showing that a complete real-time image/video processing system can be characterized as a heterogeneous architecture with a computation-oriented front end coupled with a general-purpose processing back end. Of course, the TMS320DM320 is just one good example of many currently available multiprocessor embedded SoCs. In fact, as it will be seen in the examples section, the low-power, moderate performance DSPs plus accelerators have been widely used by many research groups in the form of DSP/FPGA hybrid systems, most likely due to cost issues associated with ASIC development.

An interesting recent hardware development for digital imaging is the Texas Instrument DaVinci technology that couples an Advanced RISC Machines (ARM) processor with a highperformance C64x DSP core [160]. This technology provides the necessary processing and memory bandwidth to achieve a complete imaging SoC. Examples of research performed on multicore embedded SoC for digital camera applications can be found in the references [52, 78, 79, 108, 115, 116], which cover the development and implementations of the automatic white balancing, automatic focusing, and zoom tracking algorithms encountered in today’s digital camera systems.

3.3.4 General-Purpose Processors

There are two types of GPPs on the market today, one geared toward nonembedded applications such as desktop PCs and the other geared toward embedded applications. Today’s desktop GPPs are extremely high-performance processors with highly parallel architectures, containing features that help to exploit ILP in control-intensive, high-level image/video operations.

SIMD extensions have also been incorporated in their instruction sets allowing such processors to exploit DLP and enabling moderate acceleration of multimedia operations corresponding to low-level and intermediate-level image/video processing operations. GPPs have been outfitted with the multilevel cache feature. This feature provides the potential of having low latency memory accesses for frequently used data. These processors also require an RTOS in order to guarantee a real-time execution. Desktop GPPs are characterized by their large size, requiring a separate chip set for proper operation and communication with external memory and peripherals.

Although GPPs have massive general-purpose processing power, they are extremely highpowered devices requiring 100s of watts of power. Clearly such processors are not suitable for embedded applications. Despite this fact, advances in desktop GPPs have allowed the standard commercial off-the-shelf desktop PCs to be used for implementing nonembedded real-time image/video processing systems. In [100], it is even claimed that the desktop PC is the de facto standard for industrial machine vision applications where there is usually enough space and power available to handle a workstation. It should be noted that such industrial inspection systems usually augment the processing power of the desktop GPP with vision accelerator boards.

These boards often furnish a dedicated SIMD image/video processor for high-performance real-time processing not normally met by the SIMD extensions to the desktop GPP. Recently, a paradigm shift toward multicore processor designs for desktop PCs has occurred in order to continue making gains in processor performance.

On the embedded front, there are also several GPPs available on the market today with high-performance general-purpose processing capability suitable for exploiting ILP coupled with low power consumption and SIMD-type extensions for moderately accelerating multimedia operations, enabling the exploitation of DLP for low-level and intermediate-level image/video processing operations. EmbeddedGPPs have been used inmulticore embedded SoCs, providing the horsepower to cope with control- and branch-intensive instructions. Both embedded and desktop GPPs are supported by mature development tools and efficient compilers, allowing quick development cycles. While GPPs are quite powerful, they are neither created nor specialized to accelerate massively data parallel computations.

3.3.5 Graphics Processing Unit

The early 2000s witnessed the introduction of a new type of processor, the graphics processing unit (GPU). The primary function of such processors is for real-time rendering of threedimensional (3D) computer graphics enabling fast frame rates and higher levels of realism required for state-of-the-art 3Dgraphics in modern computer games.While the original GPUs were fixed function accelerators, current generation GPUs incorporate more flexibility through ever-increasing amounts of programmability with programmable vertex and texture/fragment units that are useful for customizing the rendering of 3D computer graphics. GPUs can also be used for accelerating computations with inherent DLP. In terms of performance, for example, an Intel 3.0-GHz Pentium 4 GPP provides 12 GFLOPS peak floating-point computational performance and 5.96-GB/s memory throughput, while the ATI Radeon X1800XT GPU provides 120 GFLOPS peak floating-point performance with 42-GB/s memory throughput [64]. This shows that GPUs can provide huge increases in GFLOPS performance and memory throughput over those of a high-performance desktop GPP.

Due to their floating-point calculation capabilities, the increased levels of programmability, and the fact that GPUs can be found in almost every desktop PC today, many researchers have been looking into ways to exploit GPUs for applications other than the real-time rendering of 3D computer graphics, an area of research referred to as general-purpose processing on the graphics processing unit (GPGPU). GPUs have already been deployed to solve realtime image/video processing problems including complete computer vision systems [21, 50], medical image reconstruction in magnetic resonance imaging (MRI) and ultrasonic imaging requiring FFT [136], stereo depth map computation [153], and subpixel accurate motion estimation at video rates [82]. A recent survey paper on the state-of-the-art in GPGPU [110] also presents several examples of how the power of GPUs has been applied to calculation-intensive problems in signal and image processing including identifying a 3D surface embedded in an MRI volume image, which is considered a difficult medical image segmentation problem, image registration, real-time simultaneous computation and visualization of motion, and tomography reconstruction.

To understand how a GPU can be used to perform image processing, let us take a look at its graphics pipeline [82]. The pipeline begins with a set of vertices or points in a 3D space to define graphics primitives, followed by applying vertex shaders to allow programmable control over vertex transformations. Once the graphics primitives have been defined, textures are then mapped onto them to add scene details. Texture shaders, known as fragment programs or kernel fragments, can also be applied to the textures. In the final stage, the pixels are rendered to either a display frame buffer or an offscreen rendering buffer known as pixel buffer for further texture processing through the graphics pipeline. Thus, image processing on a GPU can be performed by downloading an image to the GPU as a texture structure, rendering a rectangle the size of the image, and mapping the image as a texture structure to the rectangle, after which a kernel fragment program can be used to process the image taking advantage of the massive computation power of the GPU.

One key drawback of GPUs has been the data read-back throughput through the Peripheral Component Interconnect (PCI) bus, but this is expected to be mitigated with the introduction of the PCI Express bus standard. One important item to note is that just like desktop GPPs, GPUs are also high-powered devices drawing 100s of watts of power.

Although low-power GPUs for embedded applications are becoming more available, it is currently not known how well these embedded GPUs will fare in GPGPU applications. The reader is referred to [158] for more information on PGPU.

3.4 EXAMPLE SYSTEMS

3.4.1 DSP-Based Systems

Due to their high computation performance levels coupled with their low power consumption, DSPs have been used extensively in embedded devices for accelerating computation heavy components of an image/video processing algorithm in many applications such as image filtering, video surveillance, and object recognition.

3.4.1.1 Image Filtering Operations

Image filtering operations are well suited for implementation on DSP platforms as their regular, repetitive looping computation structure fits the DSP architecture well. Because of this, many

attempts have been made in implementing image filtering operations on DSPs. One of the more computationally challenging filtering problems involves the use of nonlinear filters. Such filters are often employed for the purpose of removing impulsive-type noise while at the same time maintaining the integrity of edges. One research group has consistently shown the usefulness of a single-chip, high-performance DSP for performing real-time nonlinear image filtering.

Several examples of their implementation of nonlinear filtering algorithms can be found in [51,117–120]. Due to the high computational complexity of the algorithms, high-performance, floating-point VLIW DSPs were chosen for the implementations.While the TMS320C6701 DSP running at 167 MHz was used in [51, 117, 118], the TMS320C6711 DSP running at 150 MHz was used in [119, 120]. In [51, 118, 119], it was shown that by using the DSP platform, a real-time video rate edge-preserving, nonlinear filtering could be achieved forQuarter Common Image Format (QCIF) sized video sequences. In [120], the extension of the nonlinear filtering algorithms to 3D was demonstrated using a single-chip DSP platform.

3.4.1.2 Computationally Complex Operations

Another computationally complex algorithm for which a single-chip, high-performance DSP has been utilized is automatic color reduction in the CbCr chrominance color space using the two-dimensional (2D) version of a multiscale clustering algorithm [107]. For this application, the fixed-point VLIW TMS320C6201 DSP running at 133 MHz was used as the implementation platform. The performance achieved was 20 s for 256 × 256 images, showing just how computationally demanding an algorithm could be even on a high-performance DSP.

3.4.1.3 Entire Image Processing Chains

In contrast to single-chip, high-performance DSP solutions, multichip, low-performance DSP solutions have also been popular implementation platforms, especially for low-cost implementation of a complete image/video processing chain.Oneexample of such a chain can be seen in [69], where a video surveillance system for detecting people in complex outdoor environments subject to lighting and background changes was implemented using a total of nine, low-performance TMS320C40 DSPs connected via their data bus and communication ports. This setup was able to achieve a satisfactory real-time performance of 15 fps for 1024 × 256 × 32-bit video.

Depending on the complexity of the algorithm involved, sometimes a single-chip DSP platform is adequate for implementing a complete system. A good example of such a case can be seen in [98], where the problem of real-time recognition by a small autonomous soccer playing robot was presented.Due to the dynamic nature of the environment in which the system was to operate, a robust operation against changing lighting conditions and partial occlusions at 60 fps was required. To meet these requirements, a single-chip, high-performance DSP platform was chosen over an FPGA platform in order to have a lower power consumption, an easier development, and a lower total cost. The Analog Devices ADSP-BF533 Blackfin DSP was chosen furnishing 1200-M multiply-and-accumulate (MAC) operations at 600 MHz and consuming 280-mW power. This met the requirements of having a low power consumption of less than 500 mW and an estimated performance of 800 million MAC operations.

From the above examples, it can be seen that both a single high-performance DSP and multiple low-performance DSPs have been used to implement image/video processing algorithms. As mentioned earlier, only recently DSPs have been equipped with the ability to process image/video data at video rates [160]. It is expected that the use of DSPs will continue to grow well into the future.

3.4.2 FPGA-Based Systems

Due to their flexibility in implementing custom hardware solutions, FPGAs have been used extensively for implementing a single component of an image or video processing system all the way up to the entire system. The main reason often cited for using FPGAs over other platforms is that they provide a low-cost, flexible development of high-performance, custom parallel processors, suitable for transitioning almost any kind of image/video processing algorithm from a development environment to a real-time implementation.

3.4.2.1 Image Filtering Operations

FPGAs have been used to implement various types of image filtering problems. An example of 2Dnonlinear image filtering implemented on a field programmable logic device (FPLD) appears in [131], where the problem of mammogram contrast enhancement in real-time was addressed.

The slow execution time in software motivated the desire to search for a hardware solution. The FPLD was chosen as the implementation platform for its ability to achieve higher processing performances as compared to GPPs and DSPs, and for its flexible development characteristics.

The reported results showed that this implementation allowed the filtering to be performed within 98 ms with an 8-MHz clock and 23 ms with a 33.3-MHz clock for 512 × 512 × 8-bit mammograms. Due to the flexibility of the FPLD, the modifications needed to have 12-bit accuracy were easy to incorporate into the system, only requiring making changes to the filter data-path. Another example of 2D image filtering can be found in [59], where the realtime feasibility of using fuzzy morphological filters for processing image/video sequences was discussed. An implementation of these filters was performed using the Xilinx Virtex XCV300 FPGA achieving a performance of 179 fps for 512 × 512 images. The reported results showed that the fuzzy morphological filters outperformed other filters.

Recently, 3D image filtering operations have been implemented on FPGA platforms. In these implementations, custom memory access schemes combined with high-performance memory subsystems have enabled the real-time throughput essential for 3D image processing.

One example of such an implementation can be found in [24], where real-time anisotropic diffusion filtering was performed on 3D ultrasonic images for the removal of speckle noise. Due to the complexity of anisotropic diffusion filtering and the large amount of data in 3D images, the software implementation could not generate a real-time throughput. A hardware implementation based on a single FPGA was chosen to meet the real-time requirement. The key aspects of the architecture included a custom, efficient 3D image data access scheme called “brick buffering,” which allowed an optimized, high-throughput access to 3D image data. The implementation was performed on an Altera Stratix II EP260F484C3ES FPGA running at a 200-MHz clock rate with two parallel 100-MHz, 32-bit external SDRAMs for input/output.

A real-time performance of 24 iterations per second for 128 × 128 × 128 images was achieved. Another example showing the power of an FPGA for accelerating 3D image preprocessing tasks can be found in [145], where the problem of 3D median and convolution filtering for 3D medical image processing was considered. Since software implementations could not meet the real-time constraints, a hardware-based solution using an FPGA was considered.

The FPGAplatform was chosen due to its high-performance capabilities and its flexibility. This platform made it possible to implement the median filtering and convolution operations using fast multipliers. The implementation was performed on the Xilinx Virtex II Pro 2VP125FF1696-6 FPGA achieving a performance of 95 fps for 128 × 128 × 128 images and 12 fps for 256 ×256 × 256 images.

3.4.2.2 Low-Level Operations

Normally, a single-chip FPGA solution is used specifically for accelerating low-level operations, passing on the results for high-level interpretation operations to a GPP. For example, in [13], the problem of controlling the exposure time of a charge-coupled device camera in real-time was considered using a low-level histogram-based measure. Due to the large amount of data that needed to be processed, a hardware-based solution was deemed necessary to meet the real-time requirement. The developed solution was a combination of an FPGA and software running on a host PC. The FPGA was used for the histogram calculation and noise-level calculation, the results of which were then sent to the GPP for further processing.

Another example exhibiting the use of an FPGA for accelerating low-level operations can be seen in [53], where the problem of estimating position and velocity of objects by a three-camera stereo vision system was presented using area-based features. A key requirement of the system was that it had to process images at video rates, posing the need for a dedicated hardware to meet the real-time requirement. TheFPGAtechnology was chosen as the hardware

part of a software/hardware-based solution. The tracking problem was broken up into several subtasks including segmentation, correspondence, and motion estimation. The low-level image processing operations of segmentation, noise filtering, and area measurement were implemented on the FPGA, while the higher level operations consisting of extended Kalman filtering and prediction were implemented in software. The utilized FPGA was Altera Flex 10-K100, which met the real-time requirement of the PAL signal video rate.

3.4.2.3 Standard Image Processing Operations

FPGAs have also been used for implementing basic, but computationally expensive tasks encountered in many image/video processing applications such as edge detection, moment calculation, and Hough transform.

Regarding edge detection, in [66], a subpixel edge detection algorithm for an industrial inspection application was presented.Asingle FPGA platform was chosen as the desired implementation vehicle over analog processing, a custom VLSI solution, and a hybrid FPGA/DSP solution. A single FPGA, combined with a computationally simple algorithm, provided the required detection rate and reduction in the computation and system cost. A Xilinx XC-4005E FPGAwas used along with a Xilinx Foundation ISE to synthesize and implement the developed VHDL code. The implementation was able to process high-resolution 1024 line scan images at 2000 fps using a 200-MHz clock.

Geometric moments are used extensively as key image features in many image/video processing applications, but due to the computational complexity involved in their calculation, their real-time implementation cannot be easily achieved. The problem of computing geometric moments in real time was considered in [88]. Since a real-time performance could not be achieved by using standard processors, an FPGA solution was considered. The developed algorithm was implemented on an Altera EP1K50TC144-1 FPGA, a member of theACEX1K FPGA family, using the MAX+PLUS II environment. The results showed that ten moments of a 1 megapixel image could be processed within 25 ms.

FPGAs have also been used to accelerate the computationally complex Hough transform, which forms the basis of many image/video processing algorithms. In [39], an efficient Hough transform implementation was presented. Noting that, in general, the Hough transform implementation requires a great deal of computation horsepower, a hardware approach was desired and thus an FPGA implementation was chosen over an ASIC one for its implementation flexibility.

The Hough transform algorithm was also simplified before the implementation on the FPGA by reducing the use of lookup tables and increasing the parallelism of the calculations.

The Virtex II Xilinx XC 250-5FG456C FPGA was chosen, which at the clock rate of 606 MHz generated four line values in parallel every 12 ns.

3.4.2.4 Compression Operations

When dealing with the compression of a huge amount of data such as HDTV data, FPGAs can help to cope with the massive data throughput, enabling an effective means of achieving real-time compression.

For example, in [144], the problem of computing the 2D discrete, biorthogonal wavelet transform (DBWT) for HDTV video compression was discussed. It was noted that the computation of the DBWT was the most time-consuming part of the video compression algorithm and that software implementations were not able to meet the real-time requirement, thus requiring a hardware-based solution. An FPGA was chosen as an alternative to an ASIC solution and for its flexible architecture that allowed it to be quickly reconfigured using user-defined adjustable compression parameters. The implementation and verification were performed on the Celoxica RC1000-PP PCI-based FPGA development board containing a Xilinx Virtex 2000E FPGA. The implementation was able to achieve a real-time performance of 286, 139, and 121 fps, respectively, for the three wavelet decomposition levels on 1280 × 720 resolution input images and 127, 62, and 54 fps, respectively, for the three wavelet decomposition levels on 1080 × 1920 resolution input images.

Another example of a real-time compression application enabled with the use of FPGA can be seen in [106], where the problem of real-time image compression for high-speed cameras was addressed. High-speed cameras are characterized by extremely high frame rates of the order of thousands of frames per second. Standard image compression methods are just not capable of keeping up with the high input data rate of these cameras, and thus the need for a hardware solution. A compression engine was proposed using 32 parallel image compression circuits. In total, seven FPGAs were employed in the design, one for an input buffer, one for an output buffer, one for a master controller, and four for implementing the 32 parallel image processing circuits. A frame rate of 2000 fps was achieved for 512 × 512 images.

3.4.2.5 Entire Image Processing Chains

Entire image processing systems have also been implemented using FPGAs, including face detection and tracking, inspection, stereo vision, and 3D image registration systems.

Recently, much research has been done on practical systems for face detection and tracking. One such system has been developed and shown to work in real-time on an FPGA device in [112], where face detection and tracking was considered in the resource constrained environment of an embedded mobile device. It was stated that many methods for face detection were too resource demanding for an embedded device, requiring extremely high-performance computation power, large amounts of memory, and floating-point operations. A simplified algorithm was developed and implemented on an Altera EP20K1000EBC652-1 FPGA. The FPGA was chosen as the implementation vehicle for rapid prototyping, enabling a proof of concept test before a full VLSI implementation. The achieved performance was 434 fps at a clock rate of 33 MHz.

Inspection systems often require a high-performance processing subsystem in order to cope with the involved hard real-time constraints. An FPGA is well suited for such applications. As discussed in [67], a low-cost, high-performance system for checking multiple-choice question sheets was developed by using a high-speed optical mark reader (OMR). OMRs are used for processing large amounts of data in a relatively short amount of time. However, the cost of such systems is often excessively high, limiting their widespread use. Thus, it was desired to implement the OMR system on a single FPGA to lower the system cost while having a customizable parallel processing capability. The system was implemented on a Xilinx SpartanIIE XC2S300E FPGA using VHDL, and the implementation was able to process the 3456 pixel line sensor images at a rate of 5000 fps with a 20-MHz clock. The real-time operation at 5000 fps eliminated the need for large memory storage, thus reducing the system cost. Another computationally intensive operation involves computing the depth information of a scene utilizing stereo image processing techniques. As shown in [7], an FPGA was used to achieve a real-time implementation of a dense disparity map computation using a correlationbased approach. This implementation was designed to minimize external memory accesses and to perform parallel processing of different correlation windows. The utilized FPGA belonged to the XCV800HQ240-6 Virtex family from Xilinx, and was able to produce a real-time performance of 60 disparity computations per second for 320 × 240 images.

In medical imaging, practical deployment of computationally demanding 3D image processing is ofmuch interest. In [23], the calculation-intensive problem of3Dmultimodality image registration, which is essential for practical deployment of image-guided medical procedures, was considered. Past solutions to this problem involved using a supercomputer implementation, which was not practical in a hospital setting. Thus, a custom hardware solution based on an FPGA was used, achieving a speedup comparable to a 64 parallel processor supercomputer.

The utilization of parallel memory accesses and a parallel calculation pipeline was the key to obtaining a considerable amount of speedup. An FPGA running at 200 MHz with 100-MHz memory buses was used in conjunction with high-speed SDRAMand SRAMs. In addition, the required lookup table was implemented in one 512-k memory block and all the calculations were implemented using 32-bit fixed-point numbers. The developed architecture was able to process 50 million voxels per second, providing the real-time throughput necessary in a practically useful image-guided medical treatment.

As one can see from the above diverse examples, FPGAs have been extensively deployed as flexible, custom processors for solving real-time image/video processing problems primarily due to their ability to exploit different types of parallelism inherent in an image/video processing algorithm. However, it must be noted that, in general, most of these FPGA solutions are meant to be accelerators that are hosted by a PC and not for embedded devices.

3.4.3 Hybrid Systems

There have also been many examples in the literature regarding hybrid systems, which include some combination of DSP and FPGA processors. In these systems, an FPGA is often used as a preprocessor performing the function of a parallel pixel processor for low-level and intermediate level operations, while a DSP is used for handling intermediate and high-level operations or other computationally simpler matrix–vector operations. Such systems have been shown to be capable of supporting the real-time demands of an entire image/video processing chain. Eight system examples are mentioned below to further illustrate the usefulness of a hybrid solution.

3.4.3.1 Image Segmentation Systems

An image segmentation system consists of a diverse set of operations. To meet the real-time requirements of an image segmentation system, a hybrid FPGA and DSP solution was used in [3] to implement the diverse set of operations involved. These operations included leveling, regularization, and reconstruction. The low-level operations of minimum extraction and lower/upper regularization were performed on an FPGA, while a TMS320C44 DSP was used for the implementation of the high-level operations of reconstruction and fusion of minima.

Another hybrid platform for image segmentation was reported in [4] and [5], where the segmentation problem based on thinning and crest restoration was considered. The only difference between the two references was a change in hardware, where in [4] a Mirotech Arix board with a Xilinx Virtex XCV300 FPGA and a TMS320C44 DSP were used, while in [5] an XA10 Excalibur SoC with a 32-bit RISC ARM922T microprocessor core and an Apex 20KE PLD were used. In [4], the entire segmentation chain was implemented on the FPGA, which allowed processing of 512×512 images at 125 fps. In [5], the crest restoration was implemented in software running on the ARM processor to provide more flexibility in the implementation, but at the expense of a considerable reduction in performance, causing the entire chain to take 6 s to process one 512 × 512 image.

3.4.3.2 Industrial Inspection Systems

Industrial inspection systems are characterized by a diverse processing chain. For example, in [135], an automatic quality control system for textile fabrics was designed. The system was implemented by using an FPGA for the synchronization, a DSP for the high-level texture feature extraction and neural network classification, and a host PC for the defect detection and geometric feature extraction.

Another inspection application that was successfully implemented using a hybrid platform was reported in [125], where the problem of locating the intersection of horizontal and vertical crossbars with subpixel accuracy was addressed. The real-time requirement of the embedded system was getting subpixel accurate location detection on 1024 × 1024 images at a rate of 50 fps, leaving an upper bound of 20 ms for the processing. To meet this requirement, a hybrid architecture was employed using three Xilinx XC4000 FPGA for high data rate, low-level operations (area location, horizontal and vertical center of mass calculations), and two TMS320C44 DSPs for lowdata rate, high-level operations (linear regression for line calculation). The FPGAs were chosen for their fast arithmetic and internal RAMs and ROMs, while the DSPs were chosen for their four communication ports, floating-point arithmetic, DMA coprocessor, memory buses, and 2k-word RAM and cache. To synchronize the communication between the DSPs and the FPGAs, the DSP’s communication protocol was implemented on the FPGAs.

3.4.3.3 Video Compression Systems

Video encoding systems utilizing wavelet transform coding techniques can also benefit from a hybrid platform. Such an approach was considered in [31], where the entire Motion-JPEG2000 video encoder was implemented on the high-performanceTMS320C6416 VLIWDSP, achieving encoding speeds at full video rates of 30 fps. In this application, an FPGA was used for merging the digitized image data fields into one frame, for transferring of image data, and for providing overall system control, while the encoding was implemented entirely on the DSP.

3.4.3.4 Smart Camera Systems

Emerging smart cameras consist of a diverse set of image/video processing algorithms and are often implemented using hybrid platforms. One example of such platforms can be seen in [17], where a smart digital video camera surveillance system was introduced consisting of a combination of an FPGA and a multiprocessor DSP configuration. Two TMS320C6415T DSPs were chosen, each providing up to 8000 MIPS at 1 GHz and 1MB of internal memory.

The large amount of internal memory was necessary to achieve an efficient implementation.

The FPGA provided the necessary glue logic for interfacing the DSP units to the image sensors.

Multiple DSPs were used since no single-chip DSP solution existed to satisfy the processing requirements, which included processing of 720 × 576 color video streams as opposed to smallresolution CIF and QCIF images commonly utilized in camera-equipped cell phones.

Similar to smart cameras, the image processing chain of autonomous navigation systems is also a good candidate for the utilization of a hybrid platform. For example, in [10], the problem of real-time underwater imaging for autonomous vehicle navigation at video rates was discussed.

To address the needs for such a system, a 2D array of FPGA and DSP processors was constructed for pipelined, parallel processing. Each processing element consisted of a ping-pong style memory buffer, a TMS320C51 DSP for computation, and an FPGA for communication and low-level image processing operations. For this application, the FPGA performed the image processing tasks, while the DSP computed the angular displacement and distance parameters.

These examples have illustrated that various combinations of FGPAs and DSPs can be used to solve real-time image/video processing problems. In such systems, an FPGA usually performs low-level to intermediate level operations, while a DSP handles intermediate to some computation-oriented, high-level operations. In general, hybrid platforms are suitable for realtime implementation of those image/video processing chains that incorporate a diverse set of operations.

3.4.4 GPU-Based Systems

GPU-based developments in the field of real-time image/video processing are fairly new. Therefore, only two examples from the literature are presented here including stereo depth map computation and subpixel motion estimation.

3.4.4.1 Stereo Vision System

The problem of computing a complete depth map for a stereo vision system was considered in [153]. It was pointed out that while the real-time calculation of stereo depth maps was possible with standard desktop GPPs, primarily due to advancements in clock speeds andSIMD instruction set extensions, such an implementation taxed the system to the extent that there were no more resources left to perform high-level, control-intensive, interpretation tasks. It was thus decided to make use of the computation power of the GPU for off-loading the depth map computation. This freed up resources to execute the high-level interpretation operations on the GPP.

The power of theGPUalso allowed the use of advanced features, includingmultiresolution matching, adaptive windowing, and cross-checking, not found in standard implementations. The results indicated that 289 million disparity evaluations per second could be achieved on the ATI Radeon 9800 GPU for 512 × 512 images and a 94-pixel disparity range.

3.4.4.2 Motion Estimation System

The problem of subpixel accurate motion estimation for improving the quality and efficiency of the standard video compression schemes was considered in [82]. It was pointed out that most of the standards for video coding recommend using subpixel accurate motion estimation for the highest quality, the only caveat being that the interpolation operation presents a huge computational burden. The performance goal of the system was to perform subpixel accurate motion estimation for 720 × 576 images at 25 fps using the full search algorithm, which was not feasible without some hardware assistance. To meet this real-time requirement, a GPU was used to perform the interpolation and the block matching motion estimation algorithm.

The interpolation was performed using the GPU’s inbuilt bilinear interpolation function and the motion estimation algorithm was restructured to make a better use of the available resources on the GPU. In all, a four times speedup over a GPP implementation was achieved, the primary bottleneck being the data read-back bandwidth between theGPUand the PC over anAdvanced Graphics Port (AGP) bus. It was stated that better performance gains could be achieved with the newer PCI Express bus.

As observed from these two examples, GPUs have the potential to solve computationally intensive, data parallel real-time image/video processing problems. The standard use of a GPU is to accelerate computationally intensive operations, leaving the GPP of its host free to handle other tasks. With GPU performance growing at an ever-increasing rate and the introduction of faster bus architectures, such as the PCI Express, the popularity of using GPUs for solving real-time image/video processing problems is expected to increase.

3.4.5 PC-Based Systems

PC-based systems have also been widely used for solving real-time image/video processing problems. Such systems are usually equipped with a camera and a frame grabber, using the PC as a host. Four examples of such systems are mentioned next.

3.4.5.1 Object Detection System

Object detection is a computationally complex problem, requiring a high-performance processor for practical implementations. In [152], the problem of object detection in real-time was discussed. A point was made that while VLSI, ASIC, or FPGAs can be used to meet the real-time constraint for video rate object detection, such solutions require a low-level hardware design that is often difficult to achieve by image processing developers unfamiliar with design techniques. Thus, it was decided to use the Datacube MaxPCI vision accelerator board that provided the necessary parallel computation power and high data throughput to process 1000 × 1000 images at 30 fps.

3.4.5.2 Computer Vision System

A computer vision system involves many diverse operations that map well to vision accelerator boards. For example, in [100], a generalized, scalable and modular architecture for a real-time computer vision application based on desktop PCs was presented. The architecture consisted of an image acquisition module and a PC-based processing module, where both modules could be scaled to handle more cameras and higher processing demands. The system was applied to an industrial inspection application involving quality control of TV screen manufacturing.

The implemented system made use of eight JAI CV-M10BX CCIR cameras and four Matrox Meteor II/MC frame grabbers with the PCs equipped with dual Pentium III processors running at 600 MHz.

3.4.5.3 Video Segmentation System

Another computationally complex problem involves real-time segmentation of video data. It has been shown in [154] that such a system can be implemented using off-the-shelf components without the need for high-end and expensive frame grabbers. In this reference, the problem of image sequence segmentation based on a global camera motion compensation, a robust frame differencing, and a curve evolution was discussed. A computationally efficient algorithm was developed and implemented for use on a PCwith a Pentium I 400-MHz processor.Video acquisition was done using a 3ComHomeConnect USBWeb Camera, which eliminated the need for a relatively expensive frame grabber. Of course, this was not meant to be a suitable replacement for higher resolution and higher frame rates. The segmentation performance achieved was 5 fps for 160 × 120 images, keeping in mind that the implementation was done on a rather slow GPP.

3.4.5.4 Image Fusion System

Another example involving the successful use of a vision accelerator board is reported in [132], where an adaptive image fusion algorithm was implemented to aid helicopter pilots. The real-time requirement of processing 256 × 256 images at 25 fps for image registration and a three-level pyramid decomposition was met using a hybrid hardware and software approach.

The system consisted of two cameras, each connected to its own Datacube MaxPCI vision accelerator for preprocessing, a 96-MB buffer for storing images from the frame grabber, and a separate accelerator card for image registration.

As revealed from these examples, standard desktop PCs equipped with frame grabbers can be used to solve real-time image/video processing problems.Due to their large size and high power consumption, however, such systems are usually used in industrial inspection settings or those applications where size and power consumption are not critical design issues.

3.5 REVOLUTIONARY TECHNOLOGIES

Around the late 1990s, a fundamentally different approach to real-time image/video processing systems was being developed, namely the idea of fusing the image sensor with the necessary circuitry required for image processing. This was made possible through Complementary Metal-Oxide Semiconductor (CMOS) imaging technology that allows image processing circuitry to be placed on the same die as the image sensor. One of the recent developments along this line is the SIMD pixel (SIMPil) processor [54], which is considered to be a portable multimedia supercomputer, combining the high-performance requirement of multimedia applications with the low power consumption demanded by embedded devices.

The SIMPil processor was used to implement the image processing pipeline found in digital cameras. The simulation results showed that the processing for the entire pipeline for a 1-megapixel Bayer pattern image could be executed in 1 ms on a 500-MHz SIMPil array processor, requiring only 2.8-W power consumption. In addition to this, the utilized SIMPil processor configuration had an estimated peak operation throughput on the order of 1.5 tera operations per second. Given that current digital cameras use a simplified hardwired image pipeline or a preview engine operating at a lower resolution than the full sensor resolution to allow for real-time preview on an LCD, the SIMPil processor could easily do away with the need for a preview engine, and thus allow higher resolution LCD previews. With such an on-sensor-chip image processing, the need for large image memory buffers is eliminated. This leads to a lower system cost. Of course, such a chip would need to be paired with a GPP for high-level operations and system-level control in order to be a complete system.

Technologies like the above impart a radical change in design and performance from current technologies, possessing the capability to usher in a new age for achieving real-time image/video processing.

3.6 SUMMARY

In this chapter, many topics were covered, including key architectural features such as SIMD and VLIW, an overview of DSP, FPGA, multiprocessor SoC, GPP, and GPU platforms, representative example systems from the literature, and future technologies.

It should be noted that each real-time image/video processing application has its own unique needs and requirements including speed, memory bandwidth, power consumption, cost, size, development tools, etc. [95]. Thus, to go from research to reality, it is important to first understand the needs of the system of interest and then pair them up with the appropriate technologies.