Load the cargo into the bays

Before a single core can fire, the data has to be ON the ship. Your cargo (the input array) sits on the dock — host memory, the CPU's RAM. The cores can only reach the ship's bays — device memory, the GPU's own RAM. Move the cargo across.

This is the part the rest of the course hides for you, and it's where most real GPU performance is won or lost. The dock→ship crossing runs over PCIe at ~16–32 GB/s — one to two orders of magnitude slower than the GPU's own ~1–3 TB/s memory. It's routine to leave the MAJORITY of a GPU's real-world speed on the floor (often cited as ~80%) before a single multiply happens, purely from how data is moved. Master this and you're already ahead of people who can write a correct kernel but not a fast one.

↳ Recall: host = the dock (CPU RAM), device = the ship's bays (GPU RAM). Cores read only the bays. briefing ↗

YOUR TASK

1The device bay is already allocated for you as `bay`.
2Copy the host cargo into the bay with ctx.enqueue_copy(dst, src).

💡 Destination first, source second: ctx.enqueue_copy(bay, host_cargo)

preflight/load_holds.mojo

PUZZLE 0a · LOAD THE HOLDS

Clear all 3 chapters to forge🎖️ Flight Certification