Update README.md
This commit is contained in:
parent
de3c91834e
commit
33a76b00ca
15
README.md
15
README.md
|
|
@ -42,6 +42,7 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
|
||||||
* Removed triton, flash-atten from requirements.txt for compatibility
|
* Removed triton, flash-atten from requirements.txt for compatibility
|
||||||
* Removed bitsandbytes from requirements
|
* Removed bitsandbytes from requirements
|
||||||
* Added pip installable branch based on winglian's PR
|
* Added pip installable branch based on winglian's PR
|
||||||
|
* Added cuda backend quant attention and fused mlp from GPTQ_For_Llama.
|
||||||
|
|
||||||
# Requirements
|
# Requirements
|
||||||
gptq-for-llama <br>
|
gptq-for-llama <br>
|
||||||
|
|
@ -133,3 +134,17 @@ pip install xformers
|
||||||
from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
|
from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
|
||||||
hijack_llama_attention()
|
hijack_llama_attention()
|
||||||
```
|
```
|
||||||
|
|
||||||
|
# Quant Attention and MLP Patch
|
||||||
|
|
||||||
|
Note: Currently does not support peft lora, but can use inject_lora_layers to load simple lora with only q_proj and v_proj.<br>
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
```
|
||||||
|
from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers
|
||||||
|
make_quant_attn(model)
|
||||||
|
make_fused_mlp(model)
|
||||||
|
|
||||||
|
# Lora
|
||||||
|
inject_lora_layers(model, lora_path)
|
||||||
|
```
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue