Load the cargo into the bays
New here? Read the 60-sec briefing โBefore a single core can fire, the data has to be ON the ship. Your cargo (the input array) sits on the dock โ host memory, the CPU's RAM. The cores can only reach the ship's bays โ device memory, the GPU's own RAM. Move the cargo across.
This is the part the rest of the course hides for you, and it's where most real GPU performance is won or lost. The dockโship crossing runs over PCIe at ~16โ32 GB/s โ one to two orders of magnitude slower than the GPU's own ~1โ3 TB/s memory. It's routine to leave the MAJORITY of a GPU's real-world speed on the floor (often cited as ~80%) before a single multiply happens, purely from how data is moved. Master this and you're already ahead of people who can write a correct kernel but not a fast one.
โณ Recall: host = the dock (CPU RAM), device = the ship's bays (GPU RAM). Cores read only the bays. briefing โ
- 1The device bay is already allocated for you as `bay`.
- 2Copy the host cargo into the bay with ctx.enqueue_copy(dst, src).