Parallelism boils down to two core ideas. Both might make your code faster, but let’s be real—most of the time, we’re just praying it doesn’t crash harder.
-
Data Parallelism:
You’ve got a pile of potatoes and two people (P1 and P2). So you say,
“P1, take the first half. P2, take the second.”
Both are doing the same task—just on different chunks. That’s data parallelism.
Same code runs everywhere (SPMD: Single Program, Multiple Data), only the data changes. -
Functional Parallelism:
Same potatoes, same two people. But now you say,
“P1, wash them. Then hand them to P2 to chop.”
Different tasks for different people. That’s functional parallelism.
Work is split by function, not by data.
Data Parallelism: Do the same thing to everything
Data parallelism is about doing one thing, perfectly, across many pieces of data. You take an operation—say, addition—and you apply it across an entire array, not by looping, but by letting hardware chew through many elements at once. This kind of parallelism shines when operations on elements are independent of each other i.e. P1 chopping potato 1 doesn't impact P2 chopping potato 2.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Task Parallelism: Different tasks run at the same time
Each task is a distinct thing that runs simultaneously, each act independent yet part of the same grand spectacle. They might touch shared memory. They might have side-effects. They might even compete with each other. They might blow-up. At the system level, the CPU spins up threads or processes. It schedules them across cores. It juggles context switches, locks, semaphores.
Taking this idea a notch further, we can also think of data based parallelism
Under the hood, the CPU doesn’t think in individual numbers anymore. It packs data into wide vector registers: 128 bits, 256 bits, even 512 bits wide. Instead of adding two numbers, it adds eight numbers (256 bit register can accommodate 8 32-bit floats), or sixteen, in a single breath. This is what SIMD (Single Instruction, Multiple Data) is. It's minimal. It's elegant. You don't multiply effort; you multiply throughput.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Task Parallelism: Different tasks run at the same time
Each task is a distinct thing that runs simultaneously, each act independent yet part of the same grand spectacle. They might touch shared memory. They might have side-effects. They might even compete with each other. They might blow-up. At the system level, the CPU spins up threads or processes. It schedules them across cores. It juggles context switches, locks, semaphores.
Core Ingredients -> if you are new to CUDA
In CUDA's mental model: CPU (host) and GPU (device) are separate environments.It’s called __global__
because the function becomes globally visible across two different "worlds":
-
It is defined on the GPU.
-
It can be called from the CPU. A CUDA kernel is a function that runs on the GPU. When you call a kernel, thousands of lightweight threads are launched to execute it in parallel — each thread usually works on a small piece of data.