Update readme

2023-04-26 17:53:26 +08:00 · 2023-04-26 17:53:26 +08:00 · 73f51188bf
parent 97804534b9
commit 73f51188bf
1 changed files with 52 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -5,6 +5,55 @@ Made some adjust for the code in peft and gptq for llama, and make it possible f
 pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@winglian-setup_pip
 ```

+# Model Server
+
+Better inference performance with text_generation_webui, about <b>40% faster</b>
+
+<b>Step:</b>
+1. run model server process
+2. run webui process with monkey patch
+
+<b>Example</b>
+
+run_server.sh
+```
+#!/bin/bash
+
+export PYTHONPATH=$PYTHONPATH:./
+
+CONFIG_PATH=
+MODEL_PATH=
+LORA_PATH=
+
+VENV_PATH=
+source $VENV_PATH/bin/activate
+python ./scripts/run_server.py --config_path $CONFIG_PATH --model_path $MODEL_PATH --lora_path $LORA_PATH --groupsize=128 --quant_attn --port 5555 --pub_port 5556
+```
+
+run_webui.sh
+```
+#!/bin/bash
+
+if [ -f "server2.py" ]; then
+    rm server2.py
+fi
+echo "import custom_model_server_monkey_patch" > server2.py
+cat server.py >> server2.py
+
+export PYTHONPATH=$PYTHONPATH:../
+
+VENV_PATH=
+source $VENV_PATH/bin/activate
+python server2.py --chat --listen
+```
+
+<b>Note:</b>
+* quant_attn only support torch 2.0+
+* lora support is only for simple lora with only q_proj and v_proj
+* this patch breaks model selection, lora selection and training feature in webui
+
+# Docker
+
 ## Quick start for running the chat UI

 ```
@ -43,12 +92,13 @@ It's fast on a 3070 Ti mobile.  Uses 5-6 GB of GPU RAM.
 * Removed bitsandbytes from requirements
 * Added pip installable branch based on winglian's PR
 * Added cuda backend quant attention and fused mlp from GPTQ_For_Llama.
-* Added lora patch for GPTQ_For_Llama triton backend.
-
+* Added lora patch for GPTQ_For_Llama repo triton backend.<br>
+Usage:
 ```
 from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers
 inject_lora_layers(model, lora_path, device, dtype)
 ```
+* Added Model server for better inference performance with webui (40% faster than original webui which runs model and gradio in same process)

 # Requirements
 gptq-for-llama <br>