Update README.md

This commit is contained in:
John Smith 2023-04-22 15:58:06 +08:00 committed by GitHub
parent de3c91834e
commit 33a76b00ca
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 15 additions and 0 deletions

View File

@ -42,6 +42,7 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
* Removed triton, flash-atten from requirements.txt for compatibility
* Removed bitsandbytes from requirements
* Added pip installable branch based on winglian's PR
* Added cuda backend quant attention and fused mlp from GPTQ_For_Llama.
# Requirements
gptq-for-llama <br>
@ -133,3 +134,17 @@ pip install xformers
from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
hijack_llama_attention()
```
# Quant Attention and MLP Patch
Note: Currently does not support peft lora, but can use inject_lora_layers to load simple lora with only q_proj and v_proj.<br>
Usage:
```
from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers
make_quant_attn(model)
make_fused_mlp(model)
# Lora
inject_lora_layers(model, lora_path)
```