CUDA.2 Prog

Posted Dec 22, 2025 Updated Feb 7, 2026

By Tony Kwon

8 min read

CUDA.2 Prog

Kernels

Specifying kernels

Code for kernel is specified using __global__ indicates to the compiler that this funciton will be compiled for the GPU Kernels are functions with a void return type.

Launching Kernels

Execution Configuration: setting number of thread that will execute the kernel

두 가지 방법 존재

Tripple Chevron notation
cudaLaunchKernelEx (얘는 추후에 다루겠삼)

Triple Chevron Notation ```cpp global void vecAdd(float* A, float* B, float* C) {

}

int main() { … // Kernel invocation vecAdd«<1, 256»>(A, B, C); // first: Grid Dimension, second: block dimension // result: launches single thread block containing 256 threads. …

// 2차원 이상의 dimension을 위해서는 CUDA type dim3를 사용해
dim3 grid(16,16);
dim3 block(8,8);
MatAdd<<<grid, block>>>(A, B, C); } ```

Kernel launching은 비동기(async)야, 따라서 sync를 맞추는 방법이 필요 두가지 방법 존재

Synchronizing CPU and GPU (이 페이지 뒷부분에서 소개)
Asynchronous Execution (추후 고급 기법으로 소개)

Thread and Grid Index Intrinsics

openMP할때나 thread programming 할떄 get_thread_id()이런걸로 idx만들었던거랑 같은 개념 단지 이제 CUDA라서, thread, block, grid로 확장을 시켜야돼

threadIdx
blockDim
blockIdx
gridDim: gives the dimensions of the grid, which was specified in the execution configuration when the kernel was launched.

Each of these intrinsics는 3 component vector야, .x, .y, .z 따로 Dimension을 명시하지 않으면 기본 값은 1, Idx는 0부터 시작

  
__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
     // calculate which element this thread is responsible for computing
     int workIndex = threadIdx.x + blockDim.x * blockIdx.x  // 이런식으로 사용한다. 

     if(workIndex < vectorLength)  // bound를 체크해서 array size가 multiple of the thread block size가 아니여도 작동하도록
     {
         // Perform computation
         C[workIndex] = A[workIndex] + B[workIndex];
     }
}

thread block안의 thread 몇개가 일하지 않고 막아놓는건 performance에 큰 영향을 끼치진 않아. 다만 block이 다 쉬고 그러면 문제 있는거지 vector

적당한 block size를 ceiling function을 사용해서 만들수 있, 따로 함수를 cuda에서 제공도 해줘 (<cuda/cmath> import 하기)

  
int threads = 256;
int blocks = (vectorLength + threads-1)/threads;
vecAdd<<<blocks, threads>>>(devA, devB, devC, vectorLength);

  
// vectorLength is an integer storing number of elements in the vector
int threads = 256;
int blocks = cuda::ceil_div(vectorLength, threads);
vecAdd<<<blocks, threads>>>(devA, devB, devC, vectorLength);

Memory in GPU Computing

GPU와 Host memory간 데이터 이동을 어떻게 시킬까

Unified Memory: NVIDIA Driver에게 모든 걸 맡겨, GPU나 CPU가 access를 시도할때마다 driver가 그떄 그떄 떙겨와 -> 성능 그닥 ```cpp cudaMallocManaged() cudaFree()

void unifiedMemExample(int vectorLength) { // Pointers to memory vectors float* A = nullptr; float* B = nullptr; float* C = nullptr; float* comparisonResult = (float)malloc(vectorLengthsizeof(float));

// Use unified memory to allocate buffers
cudaMallocManaged(&A, vectorLength*sizeof(float));
cudaMallocManaged(&B, vectorLength*sizeof(float));
cudaMallocManaged(&C, vectorLength*sizeof(float));

// Initialize vectors on the host
initArray(A, vectorLength);
initArray(B, vectorLength);

// Launch the kernel. Unified memory will make sure A, B, and C are
// accessible to the GPU
int threads = 256;
int blocks = cuda::ceil_div(vectorLength, threads);
vecAdd<<<blocks, threads>>>(A, B, C, vectorLength);
// Wait for the kernel to complete execution
cudaDeviceSynchronize();

// Perform computation serially on CPU for comparison
serialVecAdd(A, B, comparisonResult, vectorLength);

// Confirm that CPU and GPU got the same answer
if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
{
    printf("Unified Memory: CPU and GPU answers match\n");
}
else
{
    printf("Unified Memory: Error - CPU and GPU answers do not match\n");
}

// Clean Up
cudaFree(A);
cudaFree(B);
cudaFree(C);
free(comparisonResult);

}

2. Explicit Memory Management: CPU, GPU간 메모리를 직접 할당해줘
```cpp
void explicitMemExample(int vectorLength)
{
    // Pointers for host memory
    float* A = nullptr;
    float* B = nullptr;
    float* C = nullptr;
    float* comparisonResult = (float*)malloc(vectorLength*sizeof(float));
    
    // Pointers for device memory
    float* devA = nullptr;
    float* devB = nullptr;
    float* devC = nullptr;

    //Allocate Host Memory using cudaMallocHost API. This is best practice
    // when buffers will be used for copies between CPU and GPU memory
    // PAGE-LOCKED MEMORY를 할당해
    cudaMallocHost(&A, vectorLength*sizeof(float));
    cudaMallocHost(&B, vectorLength*sizeof(float));
    cudaMallocHost(&C, vectorLength*sizeof(float));

    // Initialize vectors on the host
    initArray(A, vectorLength);
    initArray(B, vectorLength);

    // start-allocate-and-copy
    // Allocate memory on the GPU
    cudaMalloc(&devA, vectorLength*sizeof(float));
    cudaMalloc(&devB, vectorLength*sizeof(float));
    cudaMalloc(&devC, vectorLength*sizeof(float));

    // Copy data to the GPU
    cudaMemcpy(devA, A, vectorLength*sizeof(float), cudaMemcpyDefault);
    cudaMemcpy(devB, B, vectorLength*sizeof(float), cudaMemcpyDefault);
    cudaMemset(devC, 0, vectorLength*sizeof(float));
    // end-allocate-and-copy

    // Launch the kernel
    int threads = 256;
    int blocks = cuda::ceil_div(vectorLength, threads);
    vecAdd<<<blocks, threads>>>(devA, devB, devC);
    // wait for kernel execution to complete
    cudaDeviceSynchronize();

    // Copy results back to host, synchronous한 함수야
    cudaMemcpy(C, devC, vectorLength*sizeof(float), cudaMemcpyDefault);

    // Perform computation serially on CPU for comparison
    serialVecAdd(A, B, comparisonResult, vectorLength);

    // Confirm that CPU and GPU got the same answer
    if(vectorApproximatelyEqual(C, comparisonResult, vectorLength))
    {
        printf("Explicit Memory: CPU and GPU answers match\n");
    }
    else
    {
        printf("Explicit Memory: Error - CPU and GPU answers to not match\n");
    }

    // clean up
    cudaFree(devA);
    cudaFree(devB);
    cudaFree(devC);
    cudaFreeHost(A);
    cudaFreeHost(B);
    cudaFreeHost(C);
    free(comparisonResult);
} 

Page-Locked Memory가 더 빠른 이유

Pageable Memory의 경우, 우리가 원하는 데이터의 물리적인 주소가 바뀔수도 있고, swap 될수도 있어서, 복사될때 cpu거치고 DMA를 통해 복사돼
Page Locked의 경우 메모리가 고정 되여 있어서 DMA가 바로 복사해줄수 있어 => 비동기 실행 가능

Synchronization basic

__syncthreads(): 같은 thread block안에 있는 모든 thread에 대한 barrier 역할을 한다. Shared Memory를 쓸 때 필수적

오직 하나의 thread block 내부에서만 작동
다른 block이랑 할꺼면 Cooperative Groups나 Atomic Memory Functions 알아보자
성능을 위해서는 되도록 block 내부에서 끝내자.

2.1.6 Runtime Initialization, 2.1.7 Error Checking in Cuda skip for now

CUDA 컨텍스트는 보통 첫 함수 호출 시(또는 cudaInitDevice 시)에 자동으로 생성. 생성 과정이 무겁기 때문에, 타이밍을 잘 이해하고 있어야 정확한 성능 분석이 가능

2.1.9 Variable Specifiers

변수가 저장되는 물리적 메모리 위치 지정하기

지정자 (Specifier)	저장 위치 (Memory Location)	설명
`__device__`	Global Memory	GPU의 가장 큰 메모리(VRAM). 모든 스레드가 접근 가능하지만 느림.
`__constant__`	Constant Memory	읽기 전용. 모든 스레드가 같은 값을 읽을 때 캐시 효과로 매우 빠름.
`__managed__`	Unified Memory	CPU와 GPU가 데이터를 자동으로 공유. 드라이버가 알아서 필요한 곳으로 데이터를 옮겨줌(Migration).
`__shared__`	Shared Memory	블록 내 스레드끼리 공유하는 온칩 메모리. L1 캐시급으로 매우 빠름.

HPE, CUDA

This post is licensed under CC BY 4.0 by the author.