Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Computer Architecture

Domain Specific Architectures

Custom hardware designs optimized for a specific class of applications. Flexibility is traded for higher performance and energy efficiency in a given domain.

General-purpose CPUs reached performance limits despite Moore’s Law. Even more performance improvements can be achieved using dedicated hardware designs for the given domain. However, designing a new silicon chip costs a significant amount of time and money.

Examples:

  • GPUs
  • Neural / Tensor Processing Units (NPUs, TPUs)
  • Crypto co-processors
  • Image / Vision accelerators
  • Use on-chip memories to reduce data movement.
  • Invest in arithmetic units or on-chip memory rather than control logic.
  • Choose parallelism that best fits the domain (SIMD, systolic array, etc.).
  • Use the most suitable data precision.
  • Support a domain-specific programming language or API (TensorFlow, Halide).

Each neuron computes a weighted sum and activation. Number of operations per layer = 2 × input_dim × output_dim.

  • Layers extract hierarchical features (edges → corners → shapes → objects).
  • High computation and weight reuse.
  • Core operations: matrix–matrix multiply, convolution, ReLU, sigmoid, tanh.
  • Used for sequence data; recurrent connections across time steps.
  • Many matrix–vector multiplies per cell; operations/weight ≈ 2.
  • Google’s ASIC for neural networks.
  • 256×256 8-bit matrix-multiply unit.
  • Uses large on-chip memory and acts as a PCIe co-processor.
  • Instruction set includes:
    • Read/Write host memory
    • Read weights
    • Matrix multiply / convolution
    • Activation functions
  • Dedicated memories (24 MiB buffer).
  • Many arithmetic units (250× vs CPU).
  • Exploits 2D SIMD parallelism.
  • 8-bit integers.
  • Programmed via TensorFlow.
  • FPGA-based accelerator for CNNs and search ranking.

  • Connected in a 6×8 torus network.

  • Uses Verilog RTL; 3926 ALUs, 5 MiB memory.

  • Parallelism types:

    • 2D SIMD for CNN
    • MISD for search ranking.
  • Uses both 8-bit integers and 64-bit floating point.

  • DNN training accelerator.
  • 16-bit fixed-point arithmetic.
  • Works on 32×32 matrix blocks.
  • Uses SRAM + high-bandwidth HBM2 memory.
  • Image processing and vision accelerator.
  • Performs stencil operations for imaging.
  • Uses Halide (DSL) → compiled to virtual and physical ISAs.
  • Optimized for energy and 2D access patterns.
  • Each core: 128+64 MiB memory, 16×16 2D array processing elements, 2D SIMD & VLIW.
  • Uses 8- and 16-bit integers.
PrincipleExample
Dedicated memoryTPU (unified buffer), Visual Core SRAM
Domain-matched parallelism2D SIMD, systolic arrays
Reduced data precision8- or 16-bit fixed-point
Domain languageTensorFlow, Halide
Arithmetic-focused designMany ALUs vs small control
  • Designing custom chips costs >$100M.
  • Performance counters added too late.
  • Wrong DNN task selection can waste effort.
  • Metrics like IPS may not capture full performance.
  • Ignoring past architectural lessons leads to repeat mistakes.