Custom C++ and CUDA operators for Matrix-Matrix Operations in PyTorch. Here I implemented Shared Memory Cache-Blocking and Block-tiling for both forward and backward kernels.
If you want to know how to write your own custom kernel, this PyTorch offical tutorial is all you need :)
CUDA Toolkit 12.4 PyTorch 2.4+
- Mat-Mat Mul
- Mat-Mat L1
pip install .
the interactive option : test/test.ipynb
or
python test/test_extension.py