diff --git a/README.md b/README.md index dbbb6ea..898939a 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,55 @@ Made some adjust for the code in peft and gptq for llama, and make it possible f pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@winglian-setup_pip ``` +# Model Server + +Better inference performance with text_generation_webui, about 40% faster + +Step: +1. run model server process +2. run webui process with monkey patch + +Example + +run_server.sh +``` +#!/bin/bash + +export PYTHONPATH=$PYTHONPATH:./ + +CONFIG_PATH= +MODEL_PATH= +LORA_PATH= + +VENV_PATH= +source $VENV_PATH/bin/activate +python ./scripts/run_server.py --config_path $CONFIG_PATH --model_path $MODEL_PATH --lora_path $LORA_PATH --groupsize=128 --quant_attn --port 5555 --pub_port 5556 +``` + +run_webui.sh +``` +#!/bin/bash + +if [ -f "server2.py" ]; then + rm server2.py +fi +echo "import custom_model_server_monkey_patch" > server2.py +cat server.py >> server2.py + +export PYTHONPATH=$PYTHONPATH:../ + +VENV_PATH= +source $VENV_PATH/bin/activate +python server2.py --chat --listen +``` + +Note: +* quant_attn only support torch 2.0+ +* lora support is only for simple lora with only q_proj and v_proj +* this patch breaks model selection, lora selection and training feature in webui + +# Docker + ## Quick start for running the chat UI ``` @@ -43,12 +92,13 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM. * Removed bitsandbytes from requirements * Added pip installable branch based on winglian's PR * Added cuda backend quant attention and fused mlp from GPTQ_For_Llama. -* Added lora patch for GPTQ_For_Llama triton backend. - +* Added lora patch for GPTQ_For_Llama repo triton backend.
+Usage: ``` from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers inject_lora_layers(model, lora_path, device, dtype) ``` +* Added Model server for better inference performance with webui (40% faster than original webui which runs model and gradio in same process) # Requirements gptq-for-llama