TI Edge AI - AM6xA processors with Deep Learning Accelerators and its efficiency

TI processors with Deep Learning Accelerator

TI’s AM6xA (such as AM68Ax and AM69Ax) Edge AI processors employ a heterogeneous architecture with a special purpose accelerator for deep learning computations. This accelerator is called MMA - Matrix multiply accelerator. This MMA, along with TI’s own C7x Digital Signal Processor, can do efficient tensor, vector and scalar processing. The accelerator is self-contained for deep learning processing without any dependency on the host ARM CPU. As there is an enormous amount of data transfers for the model computation, the accelerator has its own DMA engine and memory sub-system connected to the same DDR as the rest of the SoC. This, along with proprietary Super-tiling technique, results in up to 90% utilization of the accelerator engine and DDR bandwidth driving lowest power possible for energy efficient computations.


MMA architecture (source from TI)

Using MMA as the acceleration for AI functions, the overall SoC block diagram is shown in the below Figure. The architecture is similar across each Edge AI device in the portfolio, such as AM62A, AM68A, etc.


AM6xA Processor block diagram (source from TI)

Based on a heterogeneous architecture, the System-on-Chip (SoC) is optimized for easy programming on multi-core Cortex-A MicroProcessing Units (MPUs) while integrating compute-intensive tasks such as deep learning, imaging, vision, video and graphics processing. Tasks are offloaded to dedicated hardware accelerators and programmable cores. Holistic system-level integration of these cores using high-bandwidth interconnects and smart memory architectures enables high throughput and energy efficiency. Optimized system BOM is achieved through pre-integration of system components. Note that a cost- and power-optimized SoC like the AM62A does not include all hardware features, such as GPU and DMPAC, or may include reduced-performance accelerator variants to reduce power consumption.

Deep Learning Efficiency

Typically, TOPS (tera operations per second) is used to measure deep learning performance comparisons. TOPS cannot fully cover all aspects of deep learning performance, as it also relies on memory (DDR) capacity and neural network architecture.

Actual inference time depends on how efficiently the system architecture utilizes the optimal flow of data in the system. Therefore, a better performance benchmark is the inference time of a given model at a given input image resolution. Faster inference times allow more images to be processed, resulting in higher frames per second (FPS). Therefore, FPS divided by TOPS (FPS/TOPS) shows the architectural efficiency. Likewise, FPS/Watt is a good benchmark for embedded processor energy efficiency.