Custom hardware designs optimized for a specific class of applications. Flexibility is traded for higher performance and energy efficiency in a given domain.
General-purpose CPUs reached performance limits despite Moore’s Law. Even more performance improvements can be achieved using dedicated hardware designs for the given domain. However, designing a new silicon chip costs a significant amount of time and money.
Examples:
- GPUs
- Neural / Tensor Processing Units (NPUs, TPUs)
- Crypto co-processors
- Image / Vision accelerators
Design Guidelines
Section titled “Design Guidelines”- Use on-chip memories to reduce data movement.
- Invest in arithmetic units or on-chip memory rather than control logic.
- Choose parallelism that best fits the domain (SIMD, systolic array, etc.).
- Use the most suitable data precision.
- Support a domain-specific programming language or API (TensorFlow, Halide).
Deep Neural Network Architectures
Section titled “Deep Neural Network Architectures”Multi-Layer Perceptron (MLP)
Section titled “Multi-Layer Perceptron (MLP)”Each neuron computes a weighted sum and activation. Number of operations per layer = 2 × input_dim × output_dim.
Convolutional Neural Network (CNN)
Section titled “Convolutional Neural Network (CNN)”- Layers extract hierarchical features (edges → corners → shapes → objects).
- High computation and weight reuse.
- Core operations: matrix–matrix multiply, convolution, ReLU, sigmoid, tanh.
Recurrent Neural Network (RNN)
Section titled “Recurrent Neural Network (RNN)”- Used for sequence data; recurrent connections across time steps.
- Many matrix–vector multiplies per cell; operations/weight ≈ 2.
Tensor Processing Unit (TPU)
Section titled “Tensor Processing Unit (TPU)”- Google’s ASIC for neural networks.
- 256×256 8-bit matrix-multiply unit.
- Uses large on-chip memory and acts as a PCIe co-processor.
- Instruction set includes:
- Read/Write host memory
- Read weights
- Matrix multiply / convolution
- Activation functions
Design Approach
Section titled “Design Approach”- Dedicated memories (24 MiB buffer).
- Many arithmetic units (250× vs CPU).
- Exploits 2D SIMD parallelism.
- 8-bit integers.
- Programmed via TensorFlow.
Microsoft Catapult
Section titled “Microsoft Catapult”-
FPGA-based accelerator for CNNs and search ranking.
-
Connected in a 6×8 torus network.
-
Uses Verilog RTL; 3926 ALUs, 5 MiB memory.
-
Parallelism types:
- 2D SIMD for CNN
- MISD for search ranking.
-
Uses both 8-bit integers and 64-bit floating point.
Intel Crest
Section titled “Intel Crest”- DNN training accelerator.
- 16-bit fixed-point arithmetic.
- Works on 32×32 matrix blocks.
- Uses SRAM + high-bandwidth HBM2 memory.
Google Pixel Visual Core
Section titled “Google Pixel Visual Core”- Image processing and vision accelerator.
- Performs stencil operations for imaging.
- Uses Halide (DSL) → compiled to virtual and physical ISAs.
- Optimized for energy and 2D access patterns.
- Each core: 128+64 MiB memory, 16×16 2D array processing elements, 2D SIMD & VLIW.
- Uses 8- and 16-bit integers.
Common DSA Principles
Section titled “Common DSA Principles”| Principle | Example |
|---|---|
| Dedicated memory | TPU (unified buffer), Visual Core SRAM |
| Domain-matched parallelism | 2D SIMD, systolic arrays |
| Reduced data precision | 8- or 16-bit fixed-point |
| Domain language | TensorFlow, Halide |
| Arithmetic-focused design | Many ALUs vs small control |
Fallacies and Pitfalls
Section titled “Fallacies and Pitfalls”- Designing custom chips costs >$100M.
- Performance counters added too late.
- Wrong DNN task selection can waste effort.
- Metrics like IPS may not capture full performance.
- Ignoring past architectural lessons leads to repeat mistakes.