Research and Design of Infrared Image Preprocessing System Based on CNN

October 10, 2018

This article refers to the address: http://
Abstract: This paper designs a system for digital preprocessing of infrared video images with FPGA as the core processor. Using the DE2 development board provided by Altera, the majority of the functional modules of the system are integrated on one FPGA, which greatly optimizes the whole system. performance. The solution uses Altera's low-cost, high-density Cyclone II family of FPGAs to increase system design flexibility. The development of cellular neural network IP core makes full use of the advantages of cellular neural network in image processing and improves the processing efficiency of the whole system. A highly efficient digital implementation of cellular neural networks is implemented, and distributed algorithms can be used to provide higher speeds.
Key words edge detection; cellular neural network; FPGA;

introduction

The infrared video image digital preprocessing system is a necessary post-processing circuit for the infrared focal plane array detector, which has a great influence on the imaging quality. With the infrared weak target detection technology widely used in guidance, tracking, automatic control, artificial intelligence and many other fields, these applications have higher and higher requirements for infrared imaging quality. Therefore, the study of infrared video image digital preprocessing system has great significance [1-2]. This paper starts with the research of cellular neural network model, and closely combines the model and algorithm research of cellular neural network with the application of specific image processing, especially in image edge detection, to fully link theory with practice. Applying the designed template to the edge detection of the image, using the parallelism of FPGA, a CNN-based infrared image preprocessing system is established for real-time image processing.

The system hardware circuit includes two video A/D, data buffer synchronous FIFO, FPGA, data storage, color space conversion and other functional modules. The system can complete the correct reading of the IRFPA signal, and convert the read video analog signal into a digital signal through the A/D converter, buffered by the FIFO, enters the memory, and then performs necessary processing through the central digital signal processor (edge Extract), and finally output the standard VGA analog video signal to the display.

1 Image preprocessing implementation principle analysis

The purpose of infrared image preprocessing is to improve image data, suppress unwanted deformation or enhance certain image features that are important for subsequent processing, and provide convenience for subsequent target recognition and tracking. The preprocessing done here is edge extraction, that is, the operation performed on the image at the lowest level of abstraction, and the input and output processed at this time is a luminance image. These images are of the same type as the raw data captured by the sensor, typically a luminance image represented by a matrix of image function values. The core processing part of the entire system is implemented by the cellular neural network IP core.

The Cellular Neural Network (CNN) is a large-scale nonlinear analog circuit with real-time signal processing capability based on the connection of neural networks. Structurally, CNN is similar to a cellular automaton, in which each cell in a cellular neural network is only connected to its neighboring cells, and there is direct communication between the connected cells, rather than directly between adjacent cells. Joining, but due to dynamic propagation over continuous time, can indirectly affect non-adjacent cells. In theory, a cellular neural network of any dimension can be defined, but because of the need to process the image, only the two-dimensional case is considered here. A two-dimensional cellular neural network structure is shown in Figure 1, with cells representing column i of row i. The equation of state of the cell is:
It can be seen from the above equation that in the case where the CNN application uses spatially invariant coefficients, the function of the entire network is determined by two matrix sums and the cell offset value I. The matrices A and B are referred to as feedback templates and control templates, respectively. The cell structure is shown in Figure 2.
Figure 1 3x3 scale cellular neural network structure
Figure 2 CNN structure diagram
In image processing, the value of each pixel is discretely quantized whether it is a grayscale image or a binary image. When using CNN for image processing, there is a problem of quantization of its input and output. In the two-dimensional image, any point value is set without loss of generality. Usually, for a binary image, only two integer values â€‹â€‹are taken, that is, for a grayscale image, grayscale values â€‹â€‹are used, and 8-bit gray is used. The degree image is an example. In the CNN system, the input is, the output is, so when processing the binary image, the original mapping needs to be made, but it must be noted that the mapping is: the original 0 mapping is 1 (pure black), the original 1 mapping It is -1 (pure white). When processing grayscale images, we must first perform 256-level uniform quantization on the input and output values, and then map them to this uniform quantization. Also note that the original 0 map is 1 (pure black), the original 255 is mapped to -1 (pure white), and the remaining gray values â€‹â€‹are sequentially mapped from the largest to the smallest in the order of small to large [3-5].

2 overall program selection

The workflow of the system is shown in Figure 2. The digital video signal and its control signal from the CCD are first filtered through the image acquisition module to filter out the valid data. Then, through the RAW2RGB module, the interpolation algorithm is used to obtain the R of each pixel. , G, B data. In order to facilitate the core processing of the CNN module, the image data is converted from the RGB color space to the YCbCr color space before the edge extraction operation, and the Y component is processed. The processed data is then converted to RGB data by the YCbCr2RGB module to be supplied to the VGA module for LCD display.
Figure 2 Infrared image preprocessing system workflow
The whole core part is the CNN module. The algorithm used in image edge extraction is mainly the classical differential operator. The differential algorithm is difficult to implement in hardware. The algorithm for applying CNN to gray image edge extraction is because CNN is a A neural network parallel processor based on neuron local connection [3], the same circuit component array can be used to design the CNN parallel processor in hardware. This array isomorphic circuit design is beneficial to VLSI implementation. Therefore, the particle swarm algorithm is used to train the CNN template for edge extraction.

Although neighborhoods of any size are allowed in cellular neural networks, as the size of the template increases, the difficulty of hardware implementation increases. Limited by current VLSI technology, the interconnection between cells can only be partial. In this paper, the 3x3 neighborhood is specified, that is, the template A and the template B are all 3x3 matrices, and their coefficients are real coefficients. Since most image processing is currently aimed at grayscale images, the input range of cell neural network cells is limited to [-1, +1], -1 represents white pixels, 1 represents black pixels, and the rest The value represents the gray value between the two. Fixed-point numbers are used here because fixed-point numbers have higher speed and lower cost in hardware implementations, especially when calling multiply primitives in FPGAs.

The serial hardware implementation of a single cell requires at least 9 clock cycles to complete a cell state update. In order to improve the speed, a parallel structure can be adopted in calculating the cell state update. As shown in FIG. 3, by using the pipeline structure, it takes only one clock cycle to complete the update of the cell state. This paper uses a parallel structure to implement a cellular neural network in an FPGA.

Figure 3 CNN parallel implementation block diagram
3 hardware design

The image data acquisition module is used for capturing image data. According to the output timing of the image sensor MT9M011, when the video capture start key is pressed, the module starts to receive data, and receives the blanking period image while obtaining valid pixel data. Data, so the output data valid signal is set to distinguish between valid data and non-valid data in the next RAW2RGB module.

3.1 RAW2RGB module design

The MT9M011 uses a Bayer-type CFA (Color Filter Array). Since the resolution of the image sensor is 1280x1024, the interpolation algorithm used here combines every four pixels into one pixel. The change of the pixel value is shown in Figure 4. In this way, after passing through the RAW2RGB module, the resolution of the image becomes half of the original, that is, 640x512.

The hardware implementation block diagram of this module is shown in Figure 5. The control module consists of two state machines, ram_wr_state and ram_rd_state. The ram_wr_state state machine is responsible for generating the write enable and write addresses of the RAM. When the input data is valid, the input pixel data is alternately stored in two RAMs in sequence, forming a structure similar to a ping-pong operation. This state machine is responsible for generating the write enable and write addresses of the RAM. The state machine of ram_rd_state is responsible for generating the read enable and read address of the RAM.
Figure 4 Schematic diagram of color interpolation algorithm
Figure 5 Hardware block diagram of the RAW2RGB module
Here, in order to facilitate the verification of the correctness of the algorithm, the digital video stream is appropriately simplified. Since the parametric design is used for programming, this does not affect the design of the system. It is assumed here that the original data to be processed has only 12 pixels per line, taking two rows of data as an example, and the processed data (each pixel contains three color components of R, G, and B), each row contains only 6 pixels. It has been reduced by half, and the same number of lines has become the original 1/2. Thus, when the resolution of the image waiting to be processed is 1280x1024, the image resolution after passing through the module becomes 640x512. Figure 7 shows the actual output after color interpolation. It can be seen from the comparison of the expected output after color interpolation in Figure 6. The design of the color interpolation module fully meets the expected requirements.
Figure 6 Expected output after color interpolation
Figure 7 Actual output after color interpolation
3.2 Hardware design of color space conversion module

The relationship between YCbCr coordinates and RGB coordinates is as follows:

(1)
There are three schemes to implement this module design. The first scheme uses the Verilog language to describe the behavior of the conversion formula. The second scheme uses the embedded RAM in the FPGA chip to construct a multiplier lookup table, which will convert all possible intermediate results in the formula. Stored in memory. The system requires nine multiplier lookup tables. Each multiplier lookup table has a depth of 1k. The operands R, G, and B are used as addresses to access the memory. The resulting output data is the result of the multiplication operation. The speed of the lookup table multiplier is limited to the access speed of the memory used. The third scheme is to improve the first scheme, and implement the system design by using a pipeline structure, which greatly improves the operation speed. This article uses the third option. Pipeline processing is a common design tool in high speed designs. Make full use of hardware internal parallelism and increase data processing capabilities. This pipelined job is a sequence of functional units that perform operations in several steps. Each functional unit accepts input and the resulting output is the output of the buffer store. The method of implementing the pipeline structure is simple, as long as a register buffer is added between the output of each arithmetic component (including the multiplier and the adder-subtractor) and the input and output of the system. The block diagram of the color space conversion using pipeline technology is shown in Figure 8. The maximum clock frequency of a digital system is limited by the maximum gate delay between registers and registers. If the register buffer is not added after the output of each arithmetic unit, the maximum gate delay between the registers and registers is the input RGB signal to the output. The delay between YCbCr signals. Since there is a large-scale combinational logic circuit between the input RGB signal and the output YCbCr signal, the delay is large. With the pipeline structure, the combined logic between registers and registers becomes smaller, so the delay becomes smaller, which increases the system clock.

Figure 8 Block diagram of color space conversion using pipeline technology
The waveform simulation is shown in Figure 9. As can be seen from the waveform diagram, the output is delayed by 5 clock cycles compared to the input, which is the result of using a pipeline structure. For example, input (R, G, B) = (1023, 1023, 1023), output (Y, Cb, Cr) = (944, 514, 514) after 5 clock cycles. Although the output is delayed by 5 clock cycles, it only takes 1 clock cycle to calculate a pixel color conversion.
Figure 9 RGB2YCbCr module simulation output
In the same way, the multiplier lookup table can be constructed by using the scheme 2, that is, the embedded RAM in the FPGA chip, and the color space conversion of YCbCr to RGB can be realized. The waveform simulation is shown in Figure 10. As can be seen from the waveform diagram, the output is delayed by 3 clock cycles compared to the input, which is the result of using register latches. For example, input (Y, Cb, Cr) = (944, 514, 514), and output (R, G, B) = (1023, 1021, 1023) after 3 clock cycles. Although the output is delayed by 3 clock cycles, it only takes 1 clock cycle to calculate a pixel color conversion.
Figure 10 YCbCr2RGB module simulation output

3.3 IP core design of cellular neural network

According to CNN's theory, the weights in the template correspond to the eight pixels around the pixel to be processed, so before processing a pixel, you must first read in the eight pixels around the point, that is, a pixel. The result is not only related to the pixel itself, but also related to the gray value of the neighborhood point pixel. Because the CMOS image sensor is used 640 pixels per line. Therefore, the key to constructing a 3Ã—3 template is to construct a line delayer. The pixels in the video image come from the non-uniformly corrected serial data stream, so the FPGA can implement the template in parallel pipeline [6]. The hardware structure that constitutes the 3Ã—3 template is shown in Figure 11:
Figure 11 Hardware structure of 3Ã—3 template
As shown in the figure, the video input image enters the convolution module after passing through a 3Ã—3 template composed of RAM, and finally outputs the result. Since the pipeline operation mode is adopted, it is not necessary to store the entire frame image when performing image processing, as long as the domain pixel points in the template operation are stored.

Convolution implementations generally include MAC (Multiply and Accumulate Multiply Addition) and DA (Distribute Algorithm Distributed Algorithm). The MAC method generally uses multiplication addition directly for calculation. Currently, some FPGAs have internal multiplier resources, that is, hardware multipliers. The distributed algorithm transforms the complex multi-digit product into a simple AND operation, and the transition of the multiplier weight is a shift operation, which effectively increases the operation speed and reduces the complexity of the structure. Convolution calculations are implemented using distributed algorithms. Adopting the algorithm in the cellular neural network has the following advantages: reducing the size of the storage unit, realizing storage unit content sharing, and reducing the data bus bit width. In order to save FPGA on-chip resources, a serial distributed algorithm is used [7].

The serial method is implemented from the lowest bit, and the DA lookup table is addressed with the lowest bit of all input quantities, and a partial product is obtained. Move it to the right and multiply it by 2-1. Into the register, at the same time, the next low bit of the input quantity has begun to address the DA lookup table to obtain another partial product, add the previous partial product after shifting one bit to the right, and repeat the previous step until all the bits The numbers have been addressed again. It is important to note that in the case of a complement input, the value obtained by addressing the highest bit is not added to the partial product after the last one shift, but is subtracted. The resulting value is the result we need, so we can get the full serial DA mode. As can be seen from the above, it takes one clock cycle to complete an operation.

Figure 12 Schematic diagram of serial distributed algorithm
3.4 VGA module

The function of the module is to display the processed signal on the display. This process is opposite to the process in the signal processing. The digital signal is composed according to the timing of the TV signal, and the signal is required for control. Various synchronization signals. In order to verify the correctness of the VGA timing, the VGA timing is simplified as appropriate. Since the parametric design is used for programming, this does not affect the design of the system. The simulation diagram is shown in Figure 13, which meets the expected timing requirements.
Figure 13 VGA timing simulation diagram
4 image processing results
Figure 14 Image processing result display
in conclusion

In this paper, the infrared image preprocessing algorithm is implemented by FPGA, which can separate the algorithm with simple structure and large amount of computation from the DSP, and ensure that the DSP has enough time to complete other tasks such as target recognition and tracking. FPGAs give users a great deal of freedom to implement dedicated integrated digital circuits that are designed. Its simple peripheral circuit, highly flexible user field programming method, on-site definition of high-capacity digital monolithic system, the ability to repeat programming, repeated modifications of the new features, meaning no need to change the circuit, as long as the FPGA internal program, the entire system New features can be implemented. In this paper, image edge extraction is the basic application goal, and cell neural network is selected as the main research object. The theory, application and hardware implementation of CNN are systematically studied. The characteristics of this article are:

1. An algorithm for applying cellular neural networks to binary images is proposed. Based on binary images, a kernel image edge extraction algorithm based on CNN is proposed. The simulation results show that the results extracted by CNN are neat and orderly, and the continuity of edge extraction is better.

2. An efficient digital implementation scheme of cellular neural networks is proposed, which uses a bit-serial distributed algorithm to achieve the state update of cells. The cellular neural network implemented by this architecture reduces the occupation of hardware resources and the bit width of the bus. Compared with the analog implementation of the cellular neural network, the digital implementation can take up a smaller area by using a distributed algorithm, thereby providing a higher running speed.

3. The CNN is described by Verilog and implemented by the FPGA experimental platform. The experimental results show that the model can perform real-time edge extraction on the image. The Verilog description of CNN is a step forward in the hardware implementation of CNN. In digital images, CNN built with FPGA can greatly improve the real-time processing speed of images.