Update README.md

2023-04-12 13:09:48 +08:00 · 2023-04-12 13:09:48 +08:00 · 17e6a1585f
parent e946f830d4
commit 17e6a1585f
1 changed files with 16 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -34,6 +34,8 @@ It's fast on a 3070 Ti mobile.  Uses 5-6 GB of GPU RAM.
 * Added Flash attention support. (Use --flash-attention)
 * Added Triton backend to support model using groupsize and act-order. (Use --backend=triton)
 * Added g_idx support in cuda backend (need recompile cuda kernel)
 * Added xformers support
 * Removed triton, flash-atten from requirements.txt for compatibility
 # Requirements
 gptq-for-llama <br>
@ -102,3 +104,17 @@ python server.py
 It seems that we can apply a monkey patch for llama model. To use it, simply download the file from [MonkeyPatch](https://github.com/lm-sys/FastChat/blob/daa9c11080ceced2bd52c3e0027e4f64b1512683/fastchat/train/llama_flash_attn_monkey_patch.py). And also, flash-attention is needed, and currently do not support pytorch 2.0.
 Just add --flash-attention to use it for finetuning.
 # Xformers
 * Install
 ```
 pip install xformers
 ```
 * Usage
 ```
 from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
 hijack_llama_attention()
 ```