When I left the other functions to use normal atomic add it seemed like a small speedup. 4.79 it/s vs 5.23 it/s |
||
|---|---|---|
| .. | ||
| autograd_4bit.py | ||
| gradient_checkpointing.py | ||
| quant_cuda.cpp | ||
| quant_cuda_kernel.cu | ||
When I left the other functions to use normal atomic add it seemed like a small speedup. 4.79 it/s vs 5.23 it/s |
||
|---|---|---|
| .. | ||
| autograd_4bit.py | ||
| gradient_checkpointing.py | ||
| quant_cuda.cpp | ||
| quant_cuda_kernel.cu | ||