Update readme

This commit is contained in:
John Smith 2023-04-26 17:53:26 +08:00
parent 97804534b9
commit 73f51188bf
1 changed files with 52 additions and 2 deletions

View File

@ -5,6 +5,55 @@ Made some adjust for the code in peft and gptq for llama, and make it possible f
pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@winglian-setup_pip
```
# Model Server
Better inference performance with text_generation_webui, about <b>40% faster</b>
<b>Step:</b>
1. run model server process
2. run webui process with monkey patch
<b>Example</b>
run_server.sh
```
#!/bin/bash
export PYTHONPATH=$PYTHONPATH:./
CONFIG_PATH=
MODEL_PATH=
LORA_PATH=
VENV_PATH=
source $VENV_PATH/bin/activate
python ./scripts/run_server.py --config_path $CONFIG_PATH --model_path $MODEL_PATH --lora_path $LORA_PATH --groupsize=128 --quant_attn --port 5555 --pub_port 5556
```
run_webui.sh
```
#!/bin/bash
if [ -f "server2.py" ]; then
rm server2.py
fi
echo "import custom_model_server_monkey_patch" > server2.py
cat server.py >> server2.py
export PYTHONPATH=$PYTHONPATH:../
VENV_PATH=
source $VENV_PATH/bin/activate
python server2.py --chat --listen
```
<b>Note:</b>
* quant_attn only support torch 2.0+
* lora support is only for simple lora with only q_proj and v_proj
* this patch breaks model selection, lora selection and training feature in webui
# Docker
## Quick start for running the chat UI
```
@ -43,12 +92,13 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
* Removed bitsandbytes from requirements
* Added pip installable branch based on winglian's PR
* Added cuda backend quant attention and fused mlp from GPTQ_For_Llama.
* Added lora patch for GPTQ_For_Llama triton backend.
* Added lora patch for GPTQ_For_Llama repo triton backend.<br>
Usage:
```
from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers
inject_lora_layers(model, lora_path, device, dtype)
```
* Added Model server for better inference performance with webui (40% faster than original webui which runs model and gradio in same process)
# Requirements
gptq-for-llama <br>