From 33a76b00ca26de8b506f2a89740e216a0fcd3d1a Mon Sep 17 00:00:00 2001 From: John Smith Date: Sat, 22 Apr 2023 15:58:06 +0800 Subject: [PATCH 1/2] Update README.md --- README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/README.md b/README.md index e6ee248..e830cd0 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,7 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM. * Removed triton, flash-atten from requirements.txt for compatibility * Removed bitsandbytes from requirements * Added pip installable branch based on winglian's PR +* Added cuda backend quant attention and fused mlp from GPTQ_For_Llama. # Requirements gptq-for-llama
@@ -133,3 +134,17 @@ pip install xformers from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention hijack_llama_attention() ``` + +# Quant Attention and MLP Patch + +Note: Currently does not support peft lora, but can use inject_lora_layers to load simple lora with only q_proj and v_proj.
+ +Usage: +``` +from model_attn_mlp_patch import make_quant_attn, make_fused_mlp, inject_lora_layers +make_quant_attn(model) +make_fused_mlp(model) + +# Lora +inject_lora_layers(model, lora_path) +``` From 51bf10326912b670a7ab486df9e69e56cfc796f7 Mon Sep 17 00:00:00 2001 From: John Smith Date: Sat, 22 Apr 2023 16:09:38 +0800 Subject: [PATCH 2/2] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index e830cd0..dbbb6ea 100644 --- a/README.md +++ b/README.md @@ -43,6 +43,12 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM. * Removed bitsandbytes from requirements * Added pip installable branch based on winglian's PR * Added cuda backend quant attention and fused mlp from GPTQ_For_Llama. +* Added lora patch for GPTQ_For_Llama triton backend. + +``` +from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers +inject_lora_layers(model, lora_path, device, dtype) +``` # Requirements gptq-for-llama