Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Computer Architecture

Optimizations Beyond Pipelining

An enhancement of the basic pipelining. Pipeline stages are broken into smaller sub-stages. Allows executing more operations concurrently. Allows higher clock speeds. High throughput.

Increases complexity. Pipeline hazards are more likely.

A type of CPU design. Includes multiple pipelines. Some pipeline stages can be shared. Allows multiple instructions to be processed simultaneously in each clock cycle. Possible when an instruction is independent of another. Helps maximize resource utilization.

Increases performance. Reduces idle times. Improve throughput.

Increases hardware complexity. Not always effective. If instructions depend on each other, parallel execution is limited.

Refers to a situation where the pipeline cannot continue simultaneous execution due to certain dependencies or conflicts.

Refers to running instructions in a different order than in the programmed order. Only possible if an instruction’s dependencies allow it. Done to increase performance. Common in super-scalar architecture.

Refers to performing multiple operations simultaneously. Involves breaking down a task into smaller, independent sub-tasks that can be executed concurrently.

Improves processing speed and efficiency.

Aka. DLP. Executes the same instruction across multiple data points. Used at application level.

Used in image processing, matrix operations.

Aka. TLP. Executes independent tasks simultaneously. Used at application level.

Used in web servers.

The processor’s ability to execute multiple instructions from a single program simultaneously. Achieved through pipelining and similar techniques. Used at architecture level.

Exploits DLP in pipelining & speculative execution.

Executes multiple threads (of the same program) in parallel. Exploits DLP & TLP. Used at architecture level.

Used in multi-core or multi-threaded processors.

Parallel execution of tasks that are independent. Typically of different programs. Used at architecutre level.

Categorizes systems based on:

  • how many instructions can be concurrently run (single or multiple)
  • how many data streams does an instruction work on (single or multiple)

Single Instruction, Single Data. A single processing unit fetches and executes one instruction at a time on one data stream. Traditional sequential von Neumann model.

Used in single-core CPU.

Single Instruction, Multiple Data. One instruction is applied simultaneously to multiple data elements.

Mainly used in vector processors, GPUs, and DLP like image processing, audio, and array operations.

Multiple Instruction, Single Data. Aka. systolic array. Multiple processing units execute different instructions on the same data stream. Rare in practice and mostly of theoretical interest.

Used in fault-tolerant or redundant systems, where the same input is checked by diverse algorithms.

Multiple Instruction, Multiple Data. Independent processors execute different instructions on different data. Most general and widely used model.

Used in multicore CPUs, clusters, and distributed systems, supporting task-level and thread-level parallelism.

  • Implementation: Time-sharing or multiple processors.
  • Simultaneous Multi-threading (Hyperthreading): Uses multiple virtual cores on a physical core.

A technique to run multiple threads concurrently. Using 1 or more cores.

If only 1 core is available, the OS emulates multithreading by rapidly switching between threads (time-sharing). With multiple cores, threads can run truly in parallel.

Improves performance, efficiency.

  • Synchronization
  • Access control for shared resources
  • Context Switching Overhead
    The OS must frequently switch between threads, which consumes CPU time and reduces overall efficiency.
  • Thread Safety
    Code must be designed in thread-safe manner. Otherwise may lead to data corruption or unexpected behavior.
  • Resource Contention
    Threads compete for shared resources (e.g., CPU, memory), leading to delays or performance bottlenecks.
  • Debugging
    Because the execution order of threads is not predictable.

Allows a single physical core to appear as 2 logical cores to the OS. Proprietary technology by Intel.

Two separate physical processors in a single system. Each processor has its own cache and memory controller.

Multiple cores within a single processor. Typically 2 to 8. Allowing each thread to run on a separate core for improved performance. No context switches required. Designed to improve the performance of general-purpose tasks. General-purpose cores.

Cache coherency is a challenge. Occurs when multiple cores have their own caches. If the caches have the same data, changes in one cache must be reflected in others to maintain consistency. Without proper cache coherence protocols, different cores may work with outdated data, leading to errors.

Used in consumer devices like laptops.

Serial programs can run without issues in a single-threaded environment. OS can manage threads in such programs.

Programs designed to be run by multiple threads, must consider concurrency between threads. When multiple threads access the same shared memory, atomicity, mutual exclusion, and synchronization must be ensured.

RISC-V’s atomic extension includes instructions for multithreading.

An extreme variant of multi-core processor. Contains a large number of cores. Typically 8 to 100. Designed to handle workloads that benefit from a high degree of parallelism. Designed for tasks that require massive parallel processing. Consists of simpler, specialized cores that are optimized for parallel processing.

Better performance compared to multi-core processor. Consumes a lot of power.

Used in specialized applications like scientific computing, machine learning, and high-performance computing (HPC).

Aka. SoC. Integrates multiple components of a computer or electronic system into a single chip. Includes CPU, memory, I/O ports, and secondary storage.

SoCs are similar to microcontrollers but are more powerful and complex. Designed for specific applications like smartphones, tablets, and embedded systems. SoCs can run full operating systems.

Specialized processors like Neural Processing Units (NPUs), Tensor Processing Units (TPUs), and Data Processing Units for specific tasks like machine learning and data sorting.

Short for Graphical Processing Unit. Calculations required for graphics are independent. Performance improves linearly with the number of cores.

A function that runs on a GPU is a kernel.

Here is a vector addition implementation using GPU programming (CUDA).

__global__ void vectorAdd(int *A, int *B, int *C, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x; // Global index of the thread
if (i < N) { // Ensure thread index is within bounds
C[i] = A[i] + B[i];
}
}
int main() {
int N = 1000;
int *A, *B, *C; // Arrays for input and output
// Allocate and initialize A, B, C on host (CPU)
// Allocate memory on the GPU (device)
int *d_A, *d_B, *d_C;
cudaMalloc(&d_A, N * sizeof(int));
cudaMalloc(&d_B, N * sizeof(int));
cudaMalloc(&d_C, N * sizeof(int));
// Copy data from host to device
cudaMemcpy(d_A, A, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, N * sizeof(int), cudaMemcpyHostToDevice);
// Launch the kernel with N threads
int blockSize = 256; // Number of threads per block
int numBlocks = (N + blockSize - 1) / blockSize; // Number of blocks
vectorAdd<<<numBlocks, blockSize>>>(d_A, d_B, d_C, N); // Kernel call
// Copy result back from device to host
cudaMemcpy(C, d_C, N * sizeof(int), cudaMemcpyDeviceToHost);
// Clean up
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

Short for Neural Processing Unit. A specialized accelerator designed for neural network workloads. Optimized for massively parallel matrix and convolution operations, low‑precision arithmetic, and energy‑efficient inference on edge and mobile devices.

A hardware architecture consisting of a network of processing elements that rhythmically compute and pass data through the system. Ideal for matrix operations and deep learning workloads.

Short for Tensor Processing Unit. An accelerator (originally by Google) tuned for tensor computations in deep learning. Uses large systolic arrays for very high‑throughput matrix operations, commonly used in datacenter training and inference with support for efficient low‑precision formats.

Aka. Data Processing Unit. A specialized processor designed to handle data-centric tasks like sorting, filtering, and transforming large datasets. Optimized for high-throughput data processing workloads.

Cloud and distributed computing that scale to handle large amounts of data and parallel processing tasks.