CHAPTER 4
Software Methods for Real-Time Image and Video Processing
4.1 INTRODUCTION
Software methods make up another aspect of transitioning an image/video processing algorithm from a research development environment to a real-time environment running on a target hardware platform. More often than not, an algorithm that generates an acceptable performance within a development environment will not, in general, generate the same acceptable performance on the target hardware platform without any modifications. This is especially true with portable embedded devices that are resource limited with generally lower clock rates, limited amounts of memory, and lower power consumption than say a modern desktop PC with a high-performance general-purpose processor (GPP). Because of such limitations, the algorithm must be carefully and properly modified to fit the structure of the underlying hardware, making efficient and effective use of available resources. Items of interest include software architecture design, memory management, and software optimization.While Chapters 2 and 3, respectively, addressed the algorithmic and hardware platform options for implementing realtime image/video processing algorithms, this chapter covers the equally important software side of their real-time implementations. Here, software means the interface between algorithms and hardware. A crucial aspect of having a real-time image/video processing system is
the development of efficient software that maximizes the resources associated with the available hardware.
4.2 ELEMENTS OF SOFTWARE PLATFORM
Just as there are many components that make up a hardware platform for real-time image/video processing, there are also many components that make up a software platform. Programming languages, software architecture design principles, and real-time operating systems are some of the key components, which are discussed next.
4.2.1 Programming Languages
For designing image/video processing systems, two types of programming languages are employed, those used for rapid prototyping and development and those used for actual deployment in a stand-alone product. One of the difficulties of software design for real-time image/video processing systems lies in transitioning a source code from the programming language used for development, such as MATLAB r_ or LabVIEWTM, to a source code used for deployment, such as general-purpose C/C++. This section briefly covers the real-time implementation aspects of different programming languages.
4.2.1.1 Research Environment Programming Languages
The programming languages and programming styles used for production-level real-time image/video processing systems are usually quite different from those used merely for research and development during the prototyping design phase. The most common development languages used in the research phase include some combination of MATLAB, LabVIEW, or C/C++.
MATLAB is an interpreted, high-level programming language as opposed to a compiled one. This means that in a MATLAB “.m” source file, each instruction, be it a command or a variable assignment, has to be interpreted each time it is encountered during the execution of the source code. Another programming-related feature ofMATLAB is that it is not a strongly typed language, meaning that variables can be declared without any data-type specification since such specifications are interpreted during run-time. Of course, the main attribute of MATLAB lies in its easily accessible matrix–vector processing capabilities as it can handle linear algebra operations on matrix data structures without explicitly using loop constructs.
MATLAB can even be outfitted with many function libraries for virtually any type of processing via various toolboxes. Some of the relevant toolboxes include the Image Processing Toolbox, Signal Processing Toolbox, and Fixed-Point Toolbox.
MATLAB is mostly a text-based programming environment using scripts and functions written in “.m” files, but with the Simulink r_ add-on, it can be turned into a modelbased programming environment for graphical-oriented, block-based programming. Simulink, combined with the Image and Video Processing Blockset, further eases the development of an entire image or video processing system as compared to traditional text-based programming by promoting hierarchical, modular design via functional blocks that pass data between them through wires. Simulink also has built-in support for fixed-point data-types, which allows one to explore the numerical aspects of an entire system. Simulink can also be used for hardware-in-the-loop type development, where the models can be compiled to run on a target platform and data passed from MATLAB to the target for assessing the real-time capabilities of an algorithm and for fast, on-the-fly tuning of variables and parameters. All the aforementioned features combine to make MATLAB a powerful and flexible programming environment, providing rapid prototyping of image or video processing systems and easing their analysis including numerical issues.While graphical modeling of a system design is quite useful for quick prototyping, the constituent algorithms ultimately have to be converted back to text-based source code for actual deployment. Thus, algorithms can be coded in “.m” files but the process does not end there sinceMATLAB scripts are not, in general, suitable for real-time deployment.
Since MATLAB provides a rich text-based programming environment, it has the advantage that algorithms can be rapidly coded and their functionality verified. However, the features that enable the rapid development of algorithms also hinder their real-time implementation.
The “interpretation of each instruction on-the-fly” characteristic of MATLAB in fact leads to a slow execution of “.m” source files, especially in loops where even though an instruction has already been interpreted in an iteration, it must be interpreted for every iteration. Since many image/video processing algorithms involve multiple nested loops, the overhead for interpretation of the source makes MATLAB unsuitable for real-time implementation. Although eliminating loops by modifying a MATLAB source code to take advantage of its built-in vector processing capabilities would help one to speed up simulations, such modifications might pose difficulties when directly porting the source code over to other general-purpose languages such as C/C++. Other strategies for speeding upMATLAB-based simulations using “.m” files would be to either compile the source code using the MATLAB compiler or to code time critical portions in C and link them to MATLAB using its MEX functionalities.
When interfacing MATLAB with C, it is important to note that MATLAB accesses image data in column-major ordering as opposed to row-major ordering commonly used in stand-alone C-based image processing routines. Hence, any C code developed to interface with MATLAB must carefully follow this data access convention. This inconsistency hinders the use of MATLAB interfaced with C.
It should be noted that such modifications are only meant to speed up simulations within the MATLAB environment and are not meant for real-time implementation. Also, since MATLAB is not strongly typed, transitioning from a MATLAB source to a generalpurpose language for real-time deployment requires careful consideration of data-types. Using the Fixed-Point Toolbox can help ease the transition from a floating-point design to a fixedpoint design, but despite these features, the path from MATLAB “.m” source files to standard general-purpose languages such as C/C++ is still quite involved considering that efficient MATLAB vectorized code usually has to be unvectorized or C-callable signal/image processing libraries have to be employed. It should be noted that tools have been developed to ease the transition fromMATLAB to C, which could essentially automate the translation ofMATLAB “.m” source files to C source files [157].
Unlike MATLAB, which was originally a text-only programming environment until Simulink was developed, LabVIEW was designed as a graphical-oriented programming environment from the start. LabVIEW provides powerful block-based system development. Combined with toolkits such as the IMAQ Vision and Advanced Signal Processing, LabVIEW can be used for rapid prototyping of a complete image/video processing system. The power of LabVIEW comes from its graphical-based programming environment that allows hierarchical and modular system design through so-called virtual instruments (VIs) and sub-VIs. LabVIEW can also be interfaced to use image processing algorithms written in general-purpose high-level languages such as C/C++ using the standard row-major array access convention. This feature helps ease the transition to a text-based source code for real-time deployment. The advantage of LabVIEW lies in the ease with which a graphic-user-interface can be used to adjust parameters in a simulation of an image or video processing system and to visualize intermediate and final results. LabVIEW can also be used for real-time assessment of an algorithm through hardwarein-the-loop type of development. Thus, similar to Simulink, LabVIEW can also be used to gain an understanding of the algorithms forming a complete system, but it is not generally meant for real-time use in a stand-alone product.
It should be noted that in some cases, C/C++ can be used for prototyping image and video processing algorithms, although it is usually used with MATLAB or LabVIEW or an image processing library for easy development. In such cases, usually the data-types used for variable declarations are not appropriate for real-time use. In many situations, floating-point data-types must be changed to appropriate integer data-types. While coding only in C/C++ from the start is beneficial from the standpoint of not having to translate the source code to a general-purpose, high-level language as in the case of MATLAB or LabVIEW, this also hinders data visualization in that data almost always have to be exported and visualized by another program. Another problem with coding in C/C++ from the start is foregoing the benefits of programming environments, such as MATLAB and LabVIEW, which can help to rapidly develop an algorithm.
Therefore, in practice, it is appropriate to utilize whatever combination of MATLAB, LabVIEW, and C/C++ for rapid development of an image or video processing system. Itmust be stressed that while MATLAB and LabVIEW are great tools for research and development, they are not meant to be used for real-time deployment in a stand-alone product. As such, it is necessary to translate a source code developed in MATLAB or LabVIEW into a standard, general-purpose, high-level programming language.
4.2.1.2 Real-Time Programming Languages
Depending on the underlying hardware platform, three types of programming languages can be employed for real-time image/video processing algorithms. They include general-purpose, high-level programming languages, hardware-description languages, and low-level assembly languages.
By far, the most widely used language for implementing real-time image/video processing algorithms on existing processors is the high-level C/C++ programming language. C is popular simply due to the fact that almost every processor has a compiler that can compile C source code into native machine code. That is to say, C is portable across a wide variety of machines. Due to its portability, many signal and image processing libraries exist that are optimized for different target platforms, easing the development of image/video processing algorithms. One general-purpose library, not optimized for any particular system, is the library provided in the Numerical Recipes in C [123], a standard reference that can be useful in porting MATLAB codes to C, especially for matrix-based computations [111]. C also has an advantage in its bit-level manipulation operations, and its ability to handle standard arrays, or arrays of objects in the case of C++, all of which are beneficial to developing efficient image/video processing algorithms [1]. That is why after the prototyping stage in MATLAB and/or LabVIEW, it is usually recommended to code an algorithm in C as a reference or baseline algorithm, which is then to be ported and specialized to a specific target platform.
Of course, if the hardware platform of interest is an FPGA, many hardware description languages (HDLs) exist for programming FPGAs, the most popular being VHDL. However, VHDLprogramming is often arcane to image/video processing algorithm developers, requiring a fundamentally different style of programming.New tools have been developed through which one can successfully go from a system designed in MATLAB straight to an FPGA implementation [156]. Such tools have the potential to extend the benefits of an FPGA-based solution within a faster development cycle.
While high-level software optimization techniques can be applied to the time critical or bottleneck portions of a C/C++ code to extract extra performance, under the assumption that such optimizations yield unsatisfactory results, the low-level assembly language of the target processor is required to be used in order to obtain the maximum performance. Proper use of assembly language requires an in-depth knowledge of the architecture of the processor, and is often time consuming to perform. Consequently, it is recommended to use assembly language if it is absolutely necessary and only on those portions of the code that are the most time critical.
4.2.2 Software Architecture Design
While translating a source code from a research development environment to a real-time environment is an involved task, it would be beneficial if the entire software system is well thought out ahead of time. Considering that real-time image/video processing systems usually consist of thousands of lines of code, proper design principles should be practiced fromthe start in order to ensure maintainability, extensibility, and flexibility in response to changes in the hardware or the algorithm [127]. Without a proper underlying structure, the entire system ends up becoming an unmanageable collection of source codes. Thus, it is critical to utilize good software engineering practices when developing a software-based real-time image/video processing system on programmable processors.
One key method of dealing with this problem is to make the software design modular from the start, which involves abstracting out algorithmic details and creating standard interfaces or application programming interfaces (APIs) to provide easy switching among different specific implementations of an algorithm [96]. Also beneficial is to create a hierarchical, layered architecture where standard interfaces exist between the upper layers and the hardware layer to allow ease in switching out different types of hardware so that if a hardware component is changed, only minor modifications to the upper layers will be needed.
Recently, there has been interest in applying the principles of object-oriented design patterns to aid in the development of real-time image/video processing systems. These methods help to promote software reuse and improve the ease with which newfunctionalities can be added to a system with minimal effort. Noting that research on such methods is still being performed, a proper design is expected to create a more efficient, compact, and easy to understand software architecture while not adversely affecting the performance of the system.
4.2.3 Real-time Operating System
In a real-time image/video processing system, certain tasks or procedures have strict realtime deadlines, while other tasks have firm or soft real-time deadlines. In order to be able to manage the deadlines and ensure a smoothly running system, it is useful to utilize a realtime operating system. Real-time operating systems allow the assignment of different levels of priorities to different tasks. With such an assignment capability, it becomes possible to assign higher priorities to hard real-time deadline tasks and lower priorities to other firm or soft real-time tasks. For portable embedded devices such as digital cameras, a real-time operating system can be used to free the upper layer application from managing the timing and scheduling of tasks, and handling file input/output operations [77]. Therefore, a real-time operating system is an important key component of the software of any practical real-time image/video processing system since it can be used to guarantee meeting real-time deadlines and thus ensuring deterministic behavior to a certain extent.
4.3 MEMORYMANAGEMENT
It is widely recognized by hardware designers that yearly increases in memory performance slowly lags behind such increases in computing performance. Since this trend shows no signs of stopping in the near future, it becomes important to carefully consider the management of memory resources in a real-time image/video processing system, especially when a vast amount of data must be dealt with.Due to the overwhelming importance of proper memory management, this section covers key concepts such as the basics of computer memory architecture, how image data is stored in memory, and several memory management optimization strategies.
4.3.1 Memory Performance Gap
Due to the ever-increasing gap between computation performance and memory performance, memory optimizations are becoming more critical than computation optimizations, considering that most algorithms when first ported are more memory-limited than compute-limited [26].
Important items of interest are seeking out methods to reduce cache-miss rates, to fetch only that portion of the entire image data that needs to be processed into the on-chip memory, to partition data, etc. Of course, another tool used for efficient movement of data across a system without CPU intervention is the direct memory access (DMA) feature found in modern processor architectures.
4.3.2 Memory Hierarchy
Most modern hardware architectures are outfitted with a hierarchical memory where each level is separated from the processor by increasing levels of access. As the access level to the processor increases, the memory size increases while the memory access speed decreases. The structure of a memory hierarchy is designed in such a way that to provide maximum memory access performance or throughput for frequently used program or data sections. Usually the hierarchy involves two cache levels, known as L1 cache and L2 cache, followed by external memories of varying sizes and speeds. The L1 cache is placed closest to the CPU core and is usually smaller than the L2 cache which is the next closest memory level to the CPU. Some architectures place these two levels directly on the processor, while in other architectures caches are physically separated from the processor. For instance, in some digital signal processors (DSPs), e.g., the TMS320C64x, there are separate L1 program and data caches and an L2 cache. The memory available on-chip normally provides the fastest access among various types of memories. As far as external memory is concerned, there are mainly two types that are used for real-time embedded systems: SRAMs and SDRAMs, where SRAMs provide lower memory access latencies and are thus “faster” than SDRAMs.
In image/video processing applications, it is beneficial to place the image being operated on within the on-chip memory to enable the processor to quickly access the necessary data with minimum latencies, reducing the overhead of external memory accesses. Since it is often the case that an entire image or video frame cannot fit within the available on-chip memory, the processing has to be reorganized or restructured to enable an efficient implementation on the target hardware. These issues are covered in Subsection 4.3.5 on memory optimization strategies.
4.3.3 Organization of Image Data in Memory
Considering that the handling of image data is one of the main difficulties for real-time implementations, it is important to understand how image data is represented in C/C++ and what are some commonly used methods for accessing them. In C/C++, image data is stored in row-major format that simply means that it is stored as an array of rows; this is different from MATLAB that stores image data in column-major format as an array of columns.
Also, C/C++ uses zero-based indexing and square brackets to access array data, as opposed to MATLAB that uses 1s-based indexing and parenthetical brackets to access array data. Due to performance issues, images are usually not declared or accessed as two-dimensional (2D) arrays, but are declared and accessed using pointers that point to an address in memory where image data is stored in row-major format. Since a pointer points to image data in memory, a proper pointer dereferencing must be performed to access an image. Hence, the pixels of an M× N image Img2d should be accessed as Img2D [N ∗i + j ] instead of Img2D [i][ j ]. This principle can be extended to higher dimensional data as well. For instance, elements of an M× N × L three-dimensional (3D) image Img3D can be accessed as Img3D [N ∗L ∗i + L ∗ j + k] instead of Img3D [i][ j ][k].
4.3.4 Spatial Locality and Cache Hits/Misses
One restriction of the row-major storage format is the lack of spatial locality, meaning that pixels that are spatially local to each other in 2D are not stored very close to each other within memory [19, 26]. For instance, the horizontally adjacent pixels Img2D[i][ j ] and Img2D[i][ j + 1] are separated only by a few bytes in memory, whereas the vertically adjacent pixels Img2D[i][ j ] and Img2D[i + 1][ j ] are separated by several bytes of pixel data in memory [19]. Consequently, the row-major storage of image data favors horizontal or row-wise memory accesses over vertical or columnwise memory accesses. This way there is usually no performance degradation for accessing image data along rows. On the other hand, accessing image data along columns poses serious performance degradations, creating ample opportunities for frequent cache misses, especially if image data reside in external memory.
A cache miss is defined as an attempted memory access by the processor, where the desired data is not located in the cache, forcing the processor to obtain the desired data from the slower, external memory. As mentioned earlier, image data being processed should be placed in the cache for increased performance not to cause cache misses noting that cache misses pose barriers to real-time implementations. Due to the fact that many low-level and intermediate level operations require access to neighboring pixels, this can be a grave source of performance loss. Mechanisms should be put into place in order to increase the number of cache hits over cache misses, where a cache hit is defined as an attempted memory access by the processor with the consideration that the desired data is already located within the cache.
4.3.5 Memory Optimization Strategies
While memory management optimization strategies could be regarded as software optimization strategies, a distinction is made here between the two because of the overwhelming importance of memory performance bottlenecks as opposed to computation bottlenecks. Memory optimizations are meant to alleviate memory performance bottlenecks, while software optimizations are meant to alleviate computation bottlenecks.
4.3.5.1 Internal Memory Versus External Memory
As mentioned previously, due to the faster access times afforded by on-chip memories, it is best to place frequently used items within internal memory to overcome the overhead of external memory accesses. Although it is desired to place an entire image into internal memory, it is often the case that the entire image does not fit into the on-chip memory. In such cases, it would be detrimental to just leave the image in external memory. Another important strategy to deal with this issue involves allocating a buffer section in the available internal memory, partitioning the image data into blocks the size of the allocated buffer, and performing processing on the smaller data blocks. Some important image data partitioning schemes include row-stripe partitioning and block partitioning. The most commonly used partitioning scheme is the rowstripe partitioning scheme where a few lines or rows of image data are prefetched to a buffer within the on-chip memory to enable faster data accesses. The fetching of a few lines to internal memory before any processing commences also has the benefit of reducing cache misses for operations, which require 2D spatial locality, since vertically adjacent pixels would now be located in the cache. Another partitioning scheme is to divide an image into either overlapping or nonoverlapping blocks depending on the type of processing being performed.
In addition to placing image data in internal memory, other frequently used items should also be placed in internal memory [70]. Since many embedded processors have internal program and data on-chip memories, critical portions of the code and other frequently used data items such as tables should also be considered for inclusion into on-chip memory as space permits.
The benefits of on-chip memory over that of external memory cannot be stressed enough as efficient use and handling of image data and program code portions within on-chip memory is often critical to achieving a real-time performance.
4.3.5.2 Efficient Movement of Data
While making efficient use of available internal memory for storing image data is important for obtaining real-time performance, using precious CPU resources to perform the movement of data, for instance using the memcpy function, is not recommended [113, 114]. A key peripheral available in most modern processor architectures is the DMA controller, which can manage the movement of data without CPU assistance, leaving it free to focus on time critical computations rather than becoming engaged in data management. A DMA controller can usually manage multiple DMA channels simultaneously so that multiple data transfers can occur at the same time.
With the availability of DMA, efficient multibuffering strategies have been developed that allow concurrent processing and movement of data. As the name implies, multibuffering strategies make use of multiple buffers usually placed within on-chip memory to allow performing concurrent processing and movement of data. Depending on the type of processing being performed, usually three buffers are employed including buffer 1 and buffer 2 operating in the so-called “ping-pong” manner and buffer 3 operating as an output buffer. The scheme usually takes the form where a DMA channel is used to store a block of data in buffer 1, while processing proceeds on data in buffer 2 and results are placed in buffer 3. After processing on buffer 2 has been completed, the results in buffer 3 are sent out to external memory through a DMA channel, while processing proceeds with data in buffer 1 and another DMA channel is used to bring in another block of image data into buffer 2. Many variations of this scheme have been used, and some of them are detailed further in the examples section. An important and often overlooked issue regarding memory accesses is the alignment of data. DMA transfers can benefit from proper data alignment by maximizing the data bus throughput [91, 99].
4.3.5.3 Increasing Memory Access via Spatial and Temporal Locality
Another method of reducing slow external memory accesses is to move from an image-based processing scheme to a pixel-based processing scheme when multiple operations have to be performed on image data and there are no data dependencies between the operations [16].
An image-based processing scheme involves applying one operation to all the pixels and then applying another operation to all the pixels, etc. This is quite similar to the way MATLAB performs processing on images. A pixel-based processing scheme on the other hand is one that applies all the operations to one pixel, and the same is repeated for all the pixels.
The problem with an image-based processing scheme is that it does not make an efficient use of the cache memory scheme, since the same pixel would have to be read many times to complete the entire processing. In the pixel-based processing scheme, the pixel is read only once and all the operations are performed while the pixel resides in the internal on-chip memory.
Thus, not only does pixel-based processing improve spatial and temporal locality of memory accesses, but also increases the computational intensity of the implementation, a measurement commonly used to gauge if an implementation is memory limited or not [143]. Computational intensity is defined as the ratio of the number of instructions executed to the number of memory accesses. If many instructions are being executed per memory access, then the coded routine is said to have a high computational intensity, while on the other hand if a small number of instructions are executed per memory access, then the coded routine is said to have a low computational intensity. In other words, a low computational intensity means that the coded routine is memory inefficient. Therefore, since more operations are performed per memory access in a pixel-based processing scheme, the use of such a scheme is beneficial when it is applicable.
4.3.5.4 Other Memory Optimization Methods
There are many other strategies that could be employed to achieve an efficient use of memory, and most of these indeed depend on the application, the available resources, and the critical bottlenecks. Various steps, such as using global variables instead of local variables to reduce the size of the stack, allocating enough memory to the heap for dynamic allocation, and taking advantage of packed data transfers to maximize memory bandwidth, should be considered when transitioning to a real-time implementation.
4.4 SOFTWARE OPTIMIZATION
After having ported a code to standard C as a reference for implementation on the target hardware, and applying/exploring compiler-level optimizations, software profiling should be performed to see what portions of the code pose significant bottlenecks for meeting the realtime requirements. Once determining the bottlenecks, steps need to be taken to optimize those portions of the code in order to bring the execution time within an acceptable real-time range. Use of image/video processing libraries, fixed-point arithmetic, software pipelining, and subword parallelism are popular techniques that are mentioned here.
4.4.1 Profiling
There is a certain established method of performing software optimizations to improve the efficiency of a source code. One should not proceed blindly in applying software optimization techniques but should be guided by certain accepted practices. The very first stage in the software optimization process is to profile the code, that is to say, gather the expended processor clock cycles or actual execution times of every function and subfunction in the code. Most modern integrated development environments include the capability to profile codes. Profiling the code reveals the portions that are the most time consuming. The goal of software optimizations is to apply code transformations to those time critical portions of the code in order to extract maximum performance from the underlying hardware architecture. In mage/video processing, such time critical portions mostly involve nested loop statements. After applying a single transformation, the code should first be verified for functional accuracy, and then profiled once again.
If after profiling, the code still does not achieve a satisfactory performance, the process needs to be repeated again; otherwise, the software optimization phase should be stopped, as there would be no benefit in seeking a faster performance than what is actually necessary.
4.4.2 Compiler Optimization Levels
Modern compilers are usually equipped with automatic software optimizers, which gather information from the code and attempt transformations on the code to make it more efficient. Often there are several levels of optimization, each one targeting a different aspect of optimization, with the highest one offering the ability to produce the most efficient code than the other levels. After applying compiler-level optimizations, the code should be profiled to see if any increases in performance were gained. In some cases, applying compiler optimization levels would actually degrade performance for some functions. Thus, if it is possible, compiler optimizations need to be applied only to those functions that experience an increase in performance. If the performance after applying compiler optimization levels is satisfactory, then no further optimization will be necessary, bearing in mind that this is usually not the case. Normally, it is best to proceed with applying the strategies discussed next to those remaining time critical portions of the code. An important item to note is the utilization of the volatile keyword to make sure that the compiler does not remove the variables it thinks are useless when they are actually needed for proper functionality. An example of this is having a flag variable in a hardware, interrupt-triggered interrupt service routine [124].
4.4.3 Fixed-Point Versus Floating-Point Computations andNumerical Issues
When porting over a standardCreference code to the target architecture, usually floating-pointbased computations are kept intact at the first stages of optimization to verify the functionality. More often than not, floating-point computations pose a major performance bottleneck in realtime implementations, especially on fixed-point processors that are often used in embedded devices. Performing floating-point calculations on fixed-point processors is regarded as a waste of resources since floating-point calculations have to be emulated through software to properly handle the exponent and mantissa parts of the numbers. Considering that heavy computations are usually performed in loops, this performance degradation is magnified by the number of iterations through loops. In addition to this, compiler optimizations usually cannot be performed on loops with floating-point calculations on a fixed-point processor since the compiler has to add a function call to a library emulating floating-point calculations. A compiler would not optimize a loop with a function call within the loop [113, 114]. Therefore, performing floating-point calculations on fixed-point processors is not a good idea.
Fixed-point arithmetic is the preferred format of arithmetic in real-time image/video processing systems, especially in embedded systems. The reason behind the choice of fixedpoint over floating-point is that fixed-point calculations are usually faster to perform since they can be accomplished using integers without the overhead imposed by floating-point calculations that have to take into consideration the mantissa and the exponent of a floating-point number representation. Also, noting that most embedded devices utilize fixed-point processors, it is quite inefficient to emulate floating-point arithmetic on such processors as this causes a considerable reduction in performance.
The speedup gained when transitioning from floating-point-based computations to fixedpoint-based computations can be dramatic. Therefore, a transition to fixed-point arithmetic is essential toward achieving the highest levels of performance. However, in practice this is easier said than done, because the developer has to decide what the appropriate fixed-point format for the calculations should be. One of the most commonly known methods of coding fixedpoint representation of numbers is the mixed integer fractional Q format, Qm.n, denoting an m + n + 1 bit number, where m + 1 bits to the left of the radix point constitute the 2’s complement signed integer portion and n bits to the right of the radix point constitute the fractional portion [16, 124]. This format has the flexibility of allowing differing levels of accuracy to mitigate the effects of quantization on the numerical calculations. As a result, it becomes important to determine the necessary level of accuracy for each calculation and thus to use an appropriate fixed-point representation.
Numerical issues such as these and others including quantization effects, rounding, truncation, overflows, proper scaling, and the order of operations need to be carefully considered when implementing fixed-point computations. Multiplying two fractional numbers yields another fraction, but adding/subtracting two fractional numbers might generate numbers outside the fractional range causing overflow, hence the need for proper scaling [81]. The order of operations might also be critical in reducing the propagation of errors throughout a particular computation, for instance, in the calculation of local variances [124]. Also, care must be taken to keep track of the decimal point during the calculations. These issues can be rapidly and fully explored within the development environments such as MATLAB with its Fixed-point Toolbox or with Simulink, allowing a rapid determination of the required accuracy level.
It is recommended to find a representation that requires the least amount of bits in order to conserve memory and to enable use of packed data processing [16]. For example, using a 16-bit fixed-point representation instead of a 32-bit representation such as the popular Q0.15 format, or a 16-bit pure fractional fixed-point representation, allows representing numbers in the range (−1, 1) [81, 124].When implemented on a target platform, fixed-point computations are accomplished by first scaling data by an appropriate power of 2 via left bit-shifting to gain sufficient accuracy, performing the computations, and then scaling back down by the same power of 2 via right bit-shifting. Scaling by powers other than 2 is not time efficient since such scaling would often require a computationally expensive division operation in order to scale the result back to the original level [113, 114].
Some other important numerical issues involve performing division operations on fixedpoint processors. In [81, 124], several strategies are suggested including using the repeated subtraction method, using a lookup table if the input range is known a priori, or using the Newton–Raphson algorithm.
4.4.4 Optimized Software Libraries
One should not discount the use of any dedicated image and signal processing libraries supplied by the manufacturer of the processor being used. Such libraries have been carefully crafted to extract maximum performance from the processor architecture and thus their use should be considered as a means of improving performance to meet real-time requirements. Of course, one should consult the supplied documentation on the available functions before using them.
Some noteworthy libraries include Intel Integrated Performance Primitives Library for the Pentium line of processors, Texas Instruments DSPlib signal processing library, and Imglib image processing library. When used properly, these libraries can help one to cut down on the development time by not having to optimize certain basic functions that are used in many applications.While such libraries can be used to obtain a performance gain in certain cases, in other cases, it might be advantageous to specialize the implementation to the specific computation needed instead of utilizing a general-purpose function from the library.
4.4.5 Precompute Information
Sometimes certain computations are repeated over and over when in reality such computations are in fact just constants used within other computations. In such cases, it is beneficial to precompute constants and to store these in memory as a lookup table, trading the computation time with the time to access the constants in memory. Again, for frequently used constants, such data should be placed in internal memory if space permits or else in fast external memory. If the constants are always accessed from memory in a certain order, it would be helpful to store data in memory in that particular order to ease memory accesses.
4.4.6 Subroutines Versus In-Line Code
To improve performance, it is recommended to use in-line codes instead of subroutines [1]. While subroutines are usually considered to be a good software engineering practice, they incur overhead when called since variables have to be pushed onto a stack in order to be popped back upon return. To avoid such overheads, the function calls can be replaced with in-line codes. While this will increase the code size, the benefits gained in performance might outweigh the increase in the code size. This is another classic example of the space-time trade-off in software design.
4.4.7 Branch Predication
It is well known that control-intensive codes in their original forms are not amenable to parallel processing. One strategy of getting around this limitation is to employ the technique of branch predication that essentially converts control-intensive codes to data-intensive codes [91, 111]. As a result, such codes can be parallelized leading to improved performance.
4.4.8 Loop Transformations
Loops are usually the most critical portions of codes. As such, there are many strategies to help one to extract the most performance out of time critical loops both in the high-level and in the low-level (assembly) languages. It is often a good practice to first apply loop transformations in the high-level language before applying assembly optimizations.
The first step toward applying loop transformations is to carefully examine a loop to see if there are any unnecessary calculations within the loop itself. If such calculations are found, they should be removed. Calculations such as common subexpressions should usually be evaluated only once instead of multiple times. Hence, such calculations serve as candidates for removal.
Another important aspect of a loop is to avoid function calls within the loop. All function calls within a loop should be replaced with in-line codes. As mentioned before, function calls within a loop usually disqualify the loop from being examined for optimization by the compiler optimizer. Also, any floating-point operations within the loop should be removed when using a fixed-point processor since the compiler inserts software emulation to support floating-point computations. Thus, it is essential to remove any function calls within the loop.
It should be noted that this does not apply to intrinsic functions, and C-callable functions supplied by the manufacturer that map directly to the assembly-level instructions of the target processor.
One of the most effective and commonly used high-level loop transformations is loop unrolling, which can be used to increase the instruction level parallelism (ILP) of the loop. Loop unrolling allows performing multiple iterations in one pass, grouping more computations together for simultaneous access to more data and thus reducing loop overhead due to branching and helping the compiler to create a more efficient scheduling of the main loop body.Of course,
unrolling a loop increases the code size, but it can also lead to more efficient code. This is another classic example of the space-time trade-off in software design.
Software pipelining is a code optimization technique found useful for reducing the execution time of critical loops in an image/video processing algorithm. Essentially, this technique utilizes the ILP that is buried within the loops. As previously stated, a simple high-level form of increasing ILP is to perform loop unrolling, relying on the compiler to automatically schedule a software pipeline. If after profiling, satisfactory results are not obtained, then the pipeline must be hand scheduled based on the assembly language of the target hardware. This is a truly tedious task that requires analyzing a data-dependency graph and properly allocating processor resources while making sure no conflicts or data dependencies exist in instructions scheduled to run in parallel. Despite the tediousness of the task, the payoff of hand-scheduled pipelined assembly is to obtain the highest performance. There are some automatic tools available that can help relieve the burden of hand-coded software pipelining. It is recommended to use such tools if the only way to obtain the necessary level of performance is through a hand-scheduled software pipeline. Another option to consider before undertaking hand-scheduled software pipelining is to utilize compiler intrinsics that can be used to direct the compiler optimizer to produce a more efficient scheduling of a loop.
4.4.9 Packed Data Processing
Most modern architectures incorporate some form of packed data processing functionality to enhance the performance of processing data that is smaller than the width of the data path. Recall that pixel data is usually represented anywhere between 8 and 14 bits, depending on the accuracy required by the application. Since most modern data paths are of 32 bits, it is possible to process four 8-bit data or two 16-bit data pixels simultaneously, leading to processing more image data in one clock cycle. Such operations can be used to speed up matrix–vector computations frequently encountered in image/video processing algorithms. In order to use this packed data processing feature, the code usually has to be reorganized to implement the packed data processing mode, similar to code vectorizing [91]. It is important to exploit this feature to the fullest in order to process the maximum amount of image data in one clock cycle. This feature can be combined with software pipelining for even greater gains in performance. It is important to make sure data are properly aligned so that memory loads and stores operate with maximum efficiency. In the cases where it is necessary to access unaligned data, special instructions can be utilized, if available, to reduce the overhead involved from loading two adjacent data words and extracting the desired data via shifting and bit-masking operations [99].
4.5 EXAMPLES OF SOFTWARE METHODS
In what follows, several examples are presented to illustrate the deployment of the above software optimization tools for achieving real-time throughputs.
4.5.1 Software Design
Proper software design is essential for developing and maintaining real-time image/video processing systems. Several representative examples covering the design of software architectures and the use of object-oriented design principles are mentioned next.
4.5.1.1 Software Architectures
For real-time image/video processing systems, software architectures are just as important as hardware architectures. Modular and layered approaches have been used extensively in designing software architectures for real-time image/video processing systems as such approaches result in software that is easy to maintain and extend its future functionality. Here, three examples are mentioned to illustrate the use of modular and layered design and real-time operating system concepts.
One example incorporating both modular and layered approaches appears in [77], where a software architecture for an embedded multifunction digital camera system was discussed. This approach to software architecture design involved extracting application-specific features and device-dependent controls from the functional operation modules via a well-defined API and device driver interface (DDI). The architecture consisted of three layers, namely an application
layer, a functional layer, and a system layer as well as two interfaces, the API and DDI. The application layer included a graphic-user-interface (GUI), while the functional layer implemented the supported camera functions. The system layer provided hardware abstraction, allowing reuse of all upper layer codes upon changes to hardware components. The system was managed using a real-time operating system that allowed the separation of upper level functionality fromtiming schedule details and file-allocation processes. As a result, this modular and layered approach allowed reusability and extended the programmability and flexibility of the software platform.
Another example of a layered approach can be seen in [13], where the software architecture of a system used for the automatic exposure control of a charge-coupled device camera was discussed. A layered approach to software architecture with five hierarchical layers was designed to allow a better isolation of different layers for development and validation purposes. The heart of the design revolved around the real-time layer that was constructed using RT-Linux, the hard real-time extension to the popular general-purpose Linux operating system. The use of this real-time operating system in the software architecture provided the ability to prioritize the hard real-time task of updating the exposure.
Another important aspect of software architecture design for real-time image/video processing systems is in the proper scheduling of tasks with different levels of priorities by utilizing a real-time operating system. The concept of multithreaded implementation has been gaining popularity in this regard. For example, in [18], a multithreaded software architecture was discussed that allowed a medical image enhancement procedure to be implemented in real-time. In terms of thread organization, an initial thread was responsible for the GUI and administrative control of the entire application. A frame acquisition/communication thread was used to handle the frame-grabber hardware and was given a high priority. Two threads for diagnostic data analysis and real-time enhancement also ran separately. In addition, a thread for storing data to the hard disk ran with a lower priority. The administrator thread managed the interaction among the different threads and had the ability to adjust priorities or halt threads if necessary. The use of such a software architecture was beneficial to maintain a real-time response, and to be reactive to the inputs by the user.
4.5.1.2 Object-Oriented Design Patterns
As previously mentioned, there has been an increased interest in the application of objectoriented design patterns in real-time image/video processing systems. One research group has primarily been involved in promoting the use of object-oriented design patterns for such systems as discussed in the references [90, 104, 127, 128]. In [90], an object-oriented design for a class of Kalman filters was presented that helped to decrease recoding efforts in adapting the filter implementation to different applications. By using the object-oriented programming design patterns named the “Gang of Four,” a software implementation of the Kalman filter was achieved that was amenable to future extensions without much recoding effort. By making use of these design patterns, it was easy to abstract the detail of the filter implementation away, allowing a truly flexible software implementation supporting different numbers of inputs, different noise models, different filter implementations, and pixel- or blockwise processing modes. Not only can object-oriented design patterns be used to limit recoding efforts in deploying differing versions of the same algorithm, they can also be used to limit the recoding efforts for changes in hardware. In [104], it was discussed how object-oriented design patterns could be applied to a real-time image processing problem, enabling and promoting software architecture
reuse in the face of both algorithm and hardware alterations. Since real-time image/video processing systems consist of many lines of code, it would be beneficial to not have to redevelop the software architecture from scratch every time for a new design. Object-oriented design patterns can be used to help cut down on the development time by allowing the reuse of already tested and developed software for similar image/video processing problems.
Originally, it was thought that using such design patterns might have a detrimental effect on the real-time performance of a system due to the use of several layers of abstraction involved to create software reusable systems. In [127], it was shown that this was not the case. A study was performed which showed which applying object-oriented design patterns could produce well-designed and efficient image processing systems with easy extensibility, maintainability, and reuse without sacrificing real-time performance. Due to these findings, in [128], it was argued that real-time image processing applications should be written using an object-oriented approach to achieve efficient, maintainable, and understandable code from the start. Such a design approach was shown to be superior since it can eliminate inefficiencies in software that are common in other design approaches.
4.5.2 Memory Management
4.5.2.1 Increasing Spatial/Temporal Locality
There have been several examples in the literature on methods to increase the spatial or temporal locality of memory accesses for gaining speedups in processing. Here, five representative examples are mentioned that illustrate how restructuring the processing can lead to efficient use of the memory hierarchy necessary for achieving real-time performance. As mentioned previously, the measure of computational intensity can be used as an indication of memory bottlenecks in a given computation. An example of using this measure can be seen in [143], where the problem of using the Discrete Wavelet Transform (DWT) for lossless compression of medical images was considered. The DWT approach chosen was the standard method based on a quadrature mirror filterbank. It was found that the implemented convolution source code had a low computational intensity, indicating that the source code was memory inefficient. To overcome the memory bottleneck, the computations were restructured in such a way that to reduce the number of external memory accesses per convolution operation. This had the effect of raising the computational intensity of the convolution source code and thus helped to achieve more efficient use of the memory hierarchy.
In most cases, the limited size of fast internal memories forces the restructuring of the processing to make an efficient use of the memory hierarchy. For example, in [30], the problem of designing an efficient image processing library within the strict memory constraints of an embedded digital camera platform was considered. The concept of “band processing” was utilized that involved partitioning an image into several smaller-sized data bands, sequentially processing the bands using a pipeline of band-based operators, and then collecting the bands into a single-output band. Such an approach provided a means of efficiently using the limited memory resources of the embedded platform.
Another example in which the processing was restructured for an efficient use of memory resources is reported in [31], where a memory-efficient algorithm was devised for the DWT calculations based on the overlapped block-transferring scheme and reorganization of the calculations. It was shown that the chosen algorithm for computing the wavelet coefficients suffered from the cache-miss problem in the filtering operation. Despite the computational efficiency of the algorithm, the cache-miss problem caused a poor performance during vertical filtering. An overlapped block-transferring method was thus devised to overcome the cache-miss problem. The method involved partitioning the image residing in the external memory into blocks the size of the cache. The processing was then restructured to allow the elimination of the cache-misses during vertical filtering. To overcome the issue of needing adjacent block information during this filtering procedure, horizontally overlapped blocks were considered.
In contrast to the traditional block- and strip-partitioning schemes often used in the reorganization of computations, a new data organization called “super-line” processing was developed in [44, 58, 101] specifically addressing the issue of cache-misses in algorithms that use multiresolution representations of image data. This organization involved dividing an image into partitions called “super-lines,” where each partition contained all the necessary information from each level of the multiresolution representation to allow the application of the algorithm in one pass. In [44], the results showed that after application of the super-line approach, only 0.2% of the memory accesses were to the external memory, which helped to achieve a processing rate of 43.6 fps for 512 × 512 images on a desktop GPP. Also, in [101], the results showed that the super-line approach was able to reduce the miss rate from 99% to 0.8%, achieving a processing rate of 44 fps for 555 × 382 images on a desktop GPP.
4.5.2.2 Efficient Movement of Data
Due to the vast amount of data needed to be processed in real-time image/video processing systems, efficient movement of data plays an important role in obtaining real-time performance. DMA has been used extensively to allow efficient movement of data, and several buffering schemes have been developed for this purpose. For example, in [99], a data flow management scheme using a programmable DMA controller was discussed. Since the size of the on-chip data cache was limited and not large enough to hold an entire image frame, the images were processed in smaller chunks at a time. An on-chip programmable DMA controller was used to manage the movement of data concurrently with the computations, preventing the processor stalls introduced by waiting for the data to be fetched from the external memory to the on-chip memory. A double buffering scheme was employed involving the use of four buffers, two for input blocks (ping in buffer and pong in buffer) and two for output blocks (ping out buffer and pong out buffer), which were allocated in the on-chip memory. While the core processor was accessing pong out buffer, the DMA controller moved the previously calculated output block in ping out buffer to the external memory and brought the next input block from the external memory into ping in buffer.When the computation and data movements were completed, the processor and DMA controller switched buffers, and the processor started to use the ping buffers, while the DMA used the pong buffers.
Several other examples on the use of DMA for real-time image/video processing applications are reported in [31, 46, 77, 98, 138].
4.5.3 Software Optimization
There are many examples of software optimization techniques encountered in the literature. Next, several representative examples on the use of fixed-point computations, software libraries, and loop transformations are given.
4.5.3.1 Fixed-Point Computations
As most embedded platforms used in real-time image/video processing applications include fixed-point processors, reformulating floating-point computations in fixed-point is an important task for achieving real-time performance. For instance, in [11], it was found that floatingpoint computations were the major bottleneck toward reaching a real-time performance. The optimization strategy employed was to convert only the most computationally intensive parts of the algorithm over to fixed-point to ease the coding effort while still achieving a significant speedup.
Many researchers have used the Qm.n fixed-point format in transitioning their algorithms from development to implementation. For instance, in [16], the Q9.6 fixed-point format was employed in the algorithm utilized for a real-time video analysis traffic surveillance application.
The range and precision required were first verified inMATLAB before the implementation on the embedded platform. In addition, all the arithmetic operations used were implemented using the chosen fixed-point format. Another example illustrating the use of such a format appeared in [99], where the mapping of two-dimensional convolution on a VLIW media processor was considered. It was pointed out that generalized convolution required the normalization of the outcome via an expensive division operation. To avoid the division, a scaling factor was introduced into the filter kernel coefficients. Each coefficient was thus represented in the Q0.15 fixed-point format.
To support varying levels of dynamic range throughout the course of a multistage computation, a fixed-point format with a varying integer part can be used. An example of such an approach is seen in [143], where in order to deal with the dynamic range changes in the DWT (an increase from scale j − 1 to j in the forward DWT and a decrease from scale j to j – 1 in the inverse DWT), a 32-bit word length was used with a variable integer part. In order to determine the required word length, the propagation of errors from each scale was analyzed to determine the lower and upper bound errors and thus the required word length.
4.5.3.2 Software Libraries
Software libraries are used extensively in real-time image/video processing systems. Such libraries can aid in porting algorithms from a research development environment onto a target hardware platform.
While general-purpose libraries can be used as a first step toward porting, libraries optimized for the target hardware platform can provide an easy means of extracting a high level of performance from the platform without having to have a deep understanding of the underlying hardware architecture. For example, in [132], an algorithm for adaptive image fusion was originally developed in MATLAB using its Image Processing Toolbox. Since MATLAB is an interpreted language, it provided a rapid prototyping, but it was not suitable for real-time deployment. Thus, the MATLAB algorithm was ported over to the standard general-purpose C++ programming language. To simplify the porting effort and to help maximize the use of processor resources, the optimized Intel Image Processing Library was used to replace theMATLAB Image Processing Toolbox functions. Another example of the use of optimized libraries is covered in [146], where the optimized assembly code libraries supplied by the chip manufacturer were used to reduce the coding effort and to speed up the performance.
4.5.3.3 Loop Transformations
The most time-consuming portions of image/video processing algorithms often consist of lowlevel or intermediate-level operations that are implemented as nested loops, requiring multiple iterations to complete the processing for an entire image frame. The main techniques for speeding up these critical core loops include loop unrolling and software pipelining.
For example, in [99], it was noted that certain instructions took multiple clock cycles to complete, which led to latencies. Loop unrolling and software pipelining were used to cope with such latencies. Loop unrolling was able to fill the delay slots due to instruction latencies by taking advantage of single-cycle instructions via unrolling those instructions that accessed multiple data within the loop. Software pipelining was then used to overlap the operations from different iterations of the loop, allowing a parallel execution of loading and computation. Software pipelining required a prolog part to set up the loop kernel and an epilog part for finishing the computations. Loop unrolling and software pipelining were combined to increase the data processing throughput of the algorithm for obtaining a real-time performance.
Because writing a software pipeline code in assembly language is a cumbersome task, tools have been created to help assist in this process. One example of such a tool is mentioned in [49] for VLIW DSPs. In this example, the software pipeline optimization tool (SPOT) was used which combined a graphical schedule editor with an automatic conflict analyzer. The schedule editor provided a clear 2D visualization of the scheduled software pipeline with rows representing the multiple functional units of the VLIW DSP and the columns representing processor clock cycles. The conflict analyzer provided an automatic allocation of the processor registers in addition to an instant feedback on any data dependencies and other coding errors.
This tool was also capable of generating the assembly source code file of the conflict-free scheduled pipeline. The results showed that the tool was able to achieve a speedup by a factor of 20 over the optimized C code. Basically, it provided a fast, simplified, and cost-effective method for including optimized hand-scheduled pipeline code in the workflow of developing a real-time image processing system.
4.6 SUMMARY
This chapter has covered the major topics regarding software methods for achieving real-time performance including software design, memory management, and software optimization. It should be stressed that before performing any software level modifications, a thorough examination of the algorithm involved should be carried out to identify key implementation details such as the required dynamic range and input/output data accuracy, the memory requirements, and any potential parallelism inherent in the algorithm. It is also essential to have a working knowledge of the hardware in case the assembly language programming needs to be used to extract the maximum performance out of the hardware. It is worth pointing out that many of the software methods mentioned in this chapter are standard methods and have been used extensively to optimize algorithms running on various hardware platforms toward obtaining real-time performance. Also, since the application of these methods is quite straightforward, it can be assured that careful application of these methods will help one to acquire most performance out of a given hardware platform.