Neural network accelerators are playing catch up with AI applications already leaving them behind

As artificial intelligence infiltrates an increasing number of fields, its usage is moving from non-real-time situations to latency-critical real-time applications, led by autonomous vehicles. However, as the growth of AI has happened so quickly, benchmarks have fallen behind. This means teams considering hardware acceleration for self-driving cars, robotics and other real-time applications are using the wrong tools for platform selection. This results in significant costs, power wastage, and systems that are simply not designed to handle tasks like real-time inference.

The team at AImotive has a background in developing benchmarks for high performance graphics. As a result, we understand hardware platforms extremely well. When developing aiDrive, our full-stack software suite for self-driving, we saw a substantial gap in appropriate hardware. That is why we created aiWare. In creating our IP we took the demands of real-time applications, such as aiDrive, into account, not only the benchmarks everyone else uses. As a result, we can achieve far higher performance and lower latency than others, at a fraction of the power-consumption.

The traditional approach to benchmarking neural network accelerators for computer vision is centered around a relatively simple task, handled by relatively simple neural networks: image classification. An image with a resolution of 224 by 224 pixels is uploaded to the network, and the algorithms must successfully identify what is in the picture. Runtime is measured, but latency is not truly an aspect. This is where problems begin, as large input files run through complex neural networks emphasize the inherent flaws of current embedded accelerators.

Benchmarking processes must adapt to accommodate new demanding use-cases, such as self-driving technology. Current accelerators are ill-equipped to handle streams of high-resolution images at a minimum of 30 frames per second. Working in a latency critical environment means that recognition must not only be accurate, but also extremely fast. Only milliseconds should pass from the moment the camera senses a photon to the moment the vehicle carries out actuator commands based on the new information. Current benchmarking processes do not take such latency-critical use-cases into consideration. This means that hardware solutions are already lagging behind emerging applications.

To face these difficulties, new benchmarking processes should first use larger input files as the resolution of images in real-life scenarios improves. Second, benchmarking processes should consider single-batch runtime as a factor, to guarantee the safety of those in latency-critical situations.

To fully understand these requirements, consider the following use-case. A car is traveling on a highway at 75 mph, capturing images at 30 FPS. If these images are 224 by 224 pixels, the car can clearly see only a few feet ahead. At least one-megapixel images are needed to ensure the system is completely aware of its surroundings. Even if data is stored in 8-bit integers it is impossible to keep the whole data stream on the accelerator’s on-chip cache. Thus, the hardware is constantly reading and writing to and from external memory. The self-driving system is, in essence, playing a ping-pong game with itself, using its own data.

However, accessing external memory is a power-hungry task, and future accelerators must be more efficient than current systems, which can consume over 300 W. New hardware solutions will make the greatest gains by optimizing the method and number of times external memory is accessed. One possibility is to ensure that partial results are not written and read, and that all mathematical functions are completed in as few steps as possible. The other is to redefine the way external memory works. To do this, the answer lies in the creation of an external memory solution with optimized data bandwidth and fixed data access patterns.

A car traveling at 75 mph (or 120 kph) covers over 33 meters per second (110 feet per second). If the vehicle takes just  one second to react to an unexpected situation on the road, it has lost over 30 meters of safe space. This is one of the most latency-critical situations where AI is used.

Several industry actors believe that batch processing is a viable solution to the difficulties of high-resolution input computer vision. In batch processing, several images are collected and processed at the same time to save resources. Gathering images at 30 FPS, a batch of 64 images itself would take more than two seconds to collect, let alone process. That’s already 66 meters.

This simple example is enough to prove that batch processing is not an option in latency-critical use-cases. To increase the safety of those on the road, autonomous vehicles not only have to predict what will happen around them but react quicker than human drivers. A task that is simply impossible with this level of delay.

Only new benchmarking methods can ensure the development of neural network accelerators conforms to the needs of future technology. Such benchmarks must predict the challenges systems will face in the real world. Current standards are safety risks in self-driving technologies, and are not sustainable methods for measuring the performance of neural network accelerators for embedded computer vision.

The complexity of new use cases, such as autonomous driving, means that new, more sophisticated benchmarking methods are the only way to guarantee hardware solutions will cope with the evolving applications of artificial intelligence.