Ch1 : Introduction

GPU programming

📅 09/12/2025

Modern computer system are build on the CPU which is central to all modern life. It takes in pragram and perform execution sequentially. CPU stopped getting dramatically faster due to power and heat issue. But Moore’s law continues and putting more transistor to the chip i.e. more cores.

Enters parallelism with GPU, these are multiple small compute units. Where each unit is responsible for a single computation.

CPU Memory Layout

Large caches — CPUs have big L1, L2, and L3 caches to hide memory latency
Optimized for low latency — designed to get one piece of data as fast as possible
Cache-friendly access — works well even with somewhat random access patterns because the cache “saves” recently used data

GPU Memory Layout

Small caches per core — each core has much less cache (thousands of cores sharing limited cache)
Optimized for high throughput — designed to deliver lots of data, not necessarily fast for any single request
Global memory is slow — the main GPU memory (VRAM) has high latency (hundreds of cycles)
Coalesced access is critical — when adjacent threads access adjacent memory locations, the GPU can combine these into one efficient transaction. Random access patterns kill performance.
Shared memory — a small, fast, programmer-controlled memory shared within a thread block (like a manual cache)

CPUs hide memory latency with big caches. GPUs tolerate memory latency by having thousands of threads — while some threads wait for data, others execute.

Many problems can benifit from parallelization. But there are many bottleneck one has to be aware of before writing efficient GPU code. 1. Moving data back and forth CPU and GPU is time consuming 1. Amdahl’s Law: not all parts of a program can be parallelized 1. The sequential portion dominates the execution 1. Race conditions : threads accessing same data unpredictably and having invalid data 1. Synchronization overhead : coordinating threads takes time some threads are faster some might be slower 1. Load balancing : all processors should have equal work

Amdahl’s Law

Amdahl’s Law tells us the maximum speedup achievable by parallelizing a program.

Speedup = 1 / ((1 - P) + P / N)

Where: - P = fraction of the program that can be parallelized - N = number of processors

Imagine a program that takes 100 seconds on one processor, and 80% can be parallelized (P = 0.8).

Processors (N)	Calculation	Speedup	New Time
1	1 / (0.2 + 0.8/1)	1×	100s
4	1 / (0.2 + 0.8/4)	2.5×	40s
16	1 / (0.2 + 0.8/16)	4×	25s
100	1 / (0.2 + 0.8/100)	~4.8×	~21s
∞	1 / (0.2 + 0)	5×	20s

Even with infinite processors, the maximum speedup is only 5× because the sequential 20% (20 seconds) can never be parallelized.

This is why minimizing sequential code is critical in GPU programming!