DefTruth / CUDA-Learn-Notes
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
See what the GitHub community is most excited about today.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
LLM training in simple, raw C/CUDA
Original reference implementation of the CUDA rasterizer from the paper "StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering"
CUDA Kernel Benchmarking Library
Tile primitives for speedy kernels
cuGraph - RAPIDS Graph Analytics Library
CUDA accelerated rasterization of gaussian splatting
FlashInfer: Kernel Library for LLM Serving
CUDA Library Samples
From zero to hero CUDA for accelerating maths and machine learning on GPU.
RAPIDS Accelerator JNI For Apache Spark
how to optimize some algorithm in cuda.
Causal depthwise conv1d in CUDA, with a PyTorch interface
A massively parallel, optimal functional runtime in Rust
NCCL Tests