GPTQv2 support.
1. Adds dependency on `triton`
2. Refactors autograd_4bit to include both GPTQv1 and GPTQv2
3. Introduces new environment variable GPTQ_VERSION to select autograd_4bit version
4. Fixes triton kernels
5. Matrix multiplications are in fp16