CNN learning paradigm
DNNs have four crucial points: accuracy, network topology, data type, and the size of the data layer. These factors directly affect the inference part, which is the actual hardware accelerator. This hardware accelerator is also characterised by four parameters: performance, power consumption, memory bandwidth, and, of course, costs.
In current state-of-the-art neural network design, there are two main paths going forward: increase the accuracy by increasing the performance MAC (MAC: Multiply Accumulate) or reduce the performance at the same accuracy level. As a rule of thumb: in order to increase the accuracy by 5 percent, the performance must be increased by a factor of 10 currently.
Another major research area to mitigate performance increase is to compromise the data type by adopting integer or even bit representation. The applicability of data type reduction strongly depends on the problem to solve. Nevertheless, the current state of the art shows 16-bit fixed point provided with a loss of 1% accuracy against 32-bit floating point.
Looking at the energy conception in the inference layer, two factors are particularly negative: memory accesses and floating-point computations.
- 32-bit read access to the DRAM would consume 640 pJ (picoJoule) while SRAM access needs 5 pJ
- 32-bit floating-point multiplication consumes 3.7 pJ, while an 8-bit integer multiplication only requires 0.2 pJ.
In order to achieve the lowest possible power consumption for embedded systems, the inference engines will specialize in integer computation (16-bit, possibly 8-bit considering higher loss in accuracy) and a memory-free architecture (minimising access to DDR and to local SRAM).
CNN inference paradigm
Traditional computing architectures, such as CPU and GPU, are currently the mainstream for both learning and inference of CNNs, taking advantage of both their high performance and high flexibility. However, these are not effective – especially from a power consumption point of view.
For a 5x5 convolution filter, a total of 50 read (data and operands), 25 MAC, and a write back are necessary. This means that three instructions are required per MAC, where the instruction efficiency is only around 30 percent. However, this is usually covered by architectural improvements such as VLIW or Superscalar.
From the energy point of view, this leads to around 425 pj for floating point computation, of which 60 per cent is due to the actual floating-point MAC operations – considering the data is in local caches. Moving to 16-bit fixed point integer, the energy consumption drops to 276 pJ, and only 10 per cent of this is then due to the actual MAC operations. As a result, an optimized CNN architecture can provide an improvement of a factor of 20 compared to traditional CPU/GPU architectures.