TI Edge AI - AM6xA processors with Deep Learning Accelerators and its efficiency

Barley_Li · March 21, 2024, 2:43am

TI processors with Deep Learning Accelerator

TI’s AM6xA (such as AM68Ax and AM69Ax) Edge AI processors employ a heterogeneous architecture with a special purpose accelerator for deep learning computations. This accelerator is called MMA - Matrix multiply accelerator. This MMA, along with TI’s own C7x Digital Signal Processor, can do efficient tensor, vector and scalar processing. The accelerator is self-contained for deep learning processing without any dependency on the host ARM CPU. As there is an enormous amount of data transfers for the model computation, the accelerator has its own DMA engine and memory sub-system connected to the same DDR as the rest of the SoC. This, along with proprietary Super-tiling technique, results in up to 90% utilization of the accelerator engine and DDR bandwidth driving lowest power possible for energy efficient computations.

MMA architecture (source from TI)

Using MMA as the acceleration for AI functions, the overall SoC block diagram is shown in the below Figure. The architecture is similar across each Edge AI device in the portfolio, such as AM62A, AM68A, etc.

AM6xA Processor block diagram (source from TI)

Based on a heterogeneous architecture, the System-on-Chip (SoC) is optimized for easy programming on multi-core Cortex-A MicroProcessing Units (MPUs) while integrating compute-intensive tasks such as deep learning, imaging, vision, video and graphics processing. Tasks are offloaded to dedicated hardware accelerators and programmable cores. Holistic system-level integration of these cores using high-bandwidth interconnects and smart memory architectures enables high throughput and energy efficiency. Optimized system BOM is achieved through pre-integration of system components. Note that a cost- and power-optimized SoC like the AM62A does not include all hardware features, such as GPU and DMPAC, or may include reduced-performance accelerator variants to reduce power consumption.

Deep Learning Efficiency

Typically, TOPS (tera operations per second) is used to measure deep learning performance comparisons. TOPS cannot fully cover all aspects of deep learning performance, as it also relies on memory (DDR) capacity and neural network architecture.

Actual inference time depends on how efficiently the system architecture utilizes the optimal flow of data in the system. Therefore, a better performance benchmark is the inference time of a given model at a given input image resolution. Faster inference times allow more images to be processed, resulting in higher frames per second (FPS). Therefore, FPS divided by TOPS (FPS/TOPS) shows the architectural efficiency. Likewise, FPS/Watt is a good benchmark for embedded processor energy efficiency.

Topic	Replies	Views
TI Edge AI - AM6xA 處理器與深度學習加速器及其效率產品技術提示 texas-instruments	171	May 13, 2024
TI의 에지 AI - 딥러닝 가속기가 적용된 AM6xA 프로세서와 그 효율 기술 조언 texas-instruments , embedded , semiconductor , tech-tips	112	June 5, 2024
Edge AI software architecture Embedded texas-instruments , integrated-circuits-ics , microprocessors	335	March 21, 2024
NanoEdgeAIStudio: An Automated Machine Learning Tool for STM32 Developers Development Boards, Kits, Programmers stmicroelectronics , development-boards-kits-programmers , evaluation-boards	134	June 27, 2024
에지 AI 소프트웨어 아키텍처 기술 조언 texas-instruments , embedded , integrated-circuits-ics , tech-tips	131	May 31, 2024

TI Edge AI - AM6xA processors with Deep Learning Accelerators and its efficiency

TI processors with Deep Learning Accelerator

Deep Learning Efficiency

Related topics