Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 0 CUDA

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦈 Sharks existed before trees 🌳.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-Infrastructure

  • AI-Infrastructure Index

  • NVIDIA AI Infrastructure and Operations Fundamentals

  • AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

  • AI Programming Model

  • Pinned Memory (Page-Locked Memory) in CUDA and GPU Computing

  • RAPIDS and GPU Accelerated Data Science: cuDF, cuML, CUDA, NCCL and Distributed AI Pipelines

  • TensorRT and High-Performance AI Inference: CUDA, ONNX, TensorRT-LLM and GPU Optimization

  • NCCL and Distributed GPU Communication: CUDA, AllReduce, Multi-GPU and AI Cluster Networking

  • ONNX (Open Neural Network Exchange): Portable AI Models, TensorRT and Cross-Framework Inference

  • LangChain and AI Agent Orchestration: RAG, LLM Workflows, Vector Databases and Tool Calling

  • NVIDIA NeMo and Enterprise AI Platforms: Distributed LLM Training, RAG and TensorRT-LLM

  • Megatron-LM and Distributed LLM Training: Tensor Parallelism, NCCL and Trillion-Scale AI Models

  • NVIDIA Triton Inference Server: TensorRT-LLM, GPU Serving and Production AI Inference

  • NVIDIA Riva: Real-Time Conversational AI with ASR, NLP and Text-to-Speech

  • NVIDIA NGC Catalog: GPU Optimized Containers, AI Models and Enterprise AI Infrastructure

  • AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

  • AI Infra Storage: NVMe, Parallel File Systems, Object Storage, and GPUDirect Storage

  • AI/ML Operations

Cover Image for AI Programming Model

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

Next →

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

📟 CUDA (Compute Unified Device Architecture)

Parallel computing platform enabling GPU programming.

  • Thousands of parallel threads
  • Native C/C++/Python integration
  • General-purpose GPU computing

CUDA parallel model:

  • Break problem into small identical tasks
  • Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes

Before CUDA, programmers had to use graphics APIs (OpenGL, DirectX) to leverage GPU power for non-graphics tasks, which was complex and inefficient. CUDA provided a more direct and flexible way to program GPUs for general-purpose computing.

graph TD
    A[CPU] -->| Graphics API ✨ | B[GPU 🧮]
    B -->|Limited Access| C[Data Processing]
    C -->|Results| A

With CUDA, developers can write code that runs on the GPU, allowing for significant performance improvements in tasks that can be parallelized, such as matrix operations, simulations, and deep learning workloads.

graph TD
    A[CPU] -->|CUDA API 📟| B[GPU 🧮]
    B -->|Parallel Threads 🪡| C[Data Processing]
    C -->|Results| A

🪡 Kernels and Threads

A kernel is the unit of CUDA code that programmers typically write and compose, akin to a procedure or function in languages targeting CPUs.

  • A kernel is a function that runs on the GPU, executed by many threads in parallel.
  • Each thread executes the kernel with a unique thread ID, allowing for data parallelism.
__global__ void myKernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate global thread ID
    data[idx] = data[idx] * 2; // Example operation
}
  • CUDA provides a hierarchical execution model where threads are organized into blocks, and blocks are organized into a grid. This allows for scalable parallelism across a wide range of GPU architectures.

1. 🧮 Graphical Processing Clusters (GPCs)

Roughly analogous to the cores of CPUs or workshop handing a load of workload.

  • GPUs are organized into Graphics Processing Clusters (GPCs), which contain multiple Streaming Multiprocessors (SMs) and other components.
  • GPCs manage the distribution of work across SMs and handle tasks such as scheduling and memory management, ensuring efficient execution of parallel workloads on the GPU.

Example: Nvidia GTX 1080 has 15 GPCs, each containing 1280 CUDA cores, for a total of 1920 CUDA cores.

Each GPC contains several SMs

2. 🥓 Streaming Multiprocessors (SMs)

Highly parallel processing units within the GPU that execute threads in warps.

  • SMs manage the execution of threads, including scheduling, synchronization, and memory access, allowing for efficient parallel processing of tasks on the GPU.

Each SM then can have multiple Warps & 1 Ray Tracing Unit (RTU) for ray tracing workloads

2.1 🔆 Ray Tracing Units (RTUs)

Specialized hardware for accelerating ray tracing workloads, which are common in graphics rendering and increasingly used in AI applications for tasks like 3D modeling and simulation.

  • RTUs can perform ray tracing calculations much faster than traditional CUDA cores, enabling real-time ray tracing in graphics applications and accelerating ray tracing-based algorithms in AI workloads.
  • RTUs are designed to handle the complex calculations involved in ray tracing, such as intersection tests and shading computations, which can be computationally intensive when performed on general-purpose CUDA cores.

2.2 🧶 Warps

Group of 32 CUDA Cores & 1 Tensor Core that execute the same instruction simultaneously

  • Threads in a warp execute the same instruction simultaneously Single Instruction, Multiple Data (SIMD)
    • Divergence (different instructions) causes serialization and performance loss
    • Design algorithms to minimize divergence for optimal performance

2.2.1 🪡 CUDA Cores

General-purpose cores for executing a wide range of parallel tasks, including graphics rendering and general computeing workloads.

Perform standard floating-point and integer operations, making them suitable for a variety of applications beyond graphics, such as scientific computing, machine learning, and data processing.

  • Addition, multiplication,
  • Bitwise operations,

a×b+b=da \times b + b = da×b+b=d

2.2.2 🧵 Tensor Cores

Specialized cores designed for accelerating matrix operations, particularly in deep learning workloads.

  • Perform mixed-precision matrix multiply and accumulate operations, which are common in deep learning algorithms.

D=A×B+CD = A \times B + CD=A×B+C

Where A, B, C, and D are matrices. Tensor Cores can perform this operation much faster than traditional CUDA cores, especially when using lower precision formats like FP16 or INT8, which are often sufficient for deep learning tasks.

 GPCs 🧮
  |--> 🥓 SMs
     |--> 🔆 Ray Tracing Units (RTUs)
     |--> 🧶 Warps
         |--> 🪡 CuDA Cores
         |--> 🧵 Tensor Cores

cuDNN (CUDA Deep Neural Network library)

GPU-accelerated library for deep learning primitives.

Provides highly tuned implementations for standard routines such as:

  • forward and backward convolution
    • attention
    • matmul
    • pooling
    • normalization.

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

Next →

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

AI-Infrastructure/2-0-CUDA
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.