diff --git a/README.md b/README.md
index dbbb6ea..898939a 100644
--- a/README.md
+++ b/README.md
@@ -5,6 +5,55 @@ Made some adjust for the code in peft and gptq for llama, and make it possible f
pip install git+https://github.com/johnsmith0031/alpaca_lora_4bit@winglian-setup_pip
```
+# Model Server
+
+Better inference performance with text_generation_webui, about 40% faster
+
+Step:
+1. run model server process
+2. run webui process with monkey patch
+
+Example
+
+run_server.sh
+```
+#!/bin/bash
+
+export PYTHONPATH=$PYTHONPATH:./
+
+CONFIG_PATH=
+MODEL_PATH=
+LORA_PATH=
+
+VENV_PATH=
+source $VENV_PATH/bin/activate
+python ./scripts/run_server.py --config_path $CONFIG_PATH --model_path $MODEL_PATH --lora_path $LORA_PATH --groupsize=128 --quant_attn --port 5555 --pub_port 5556
+```
+
+run_webui.sh
+```
+#!/bin/bash
+
+if [ -f "server2.py" ]; then
+ rm server2.py
+fi
+echo "import custom_model_server_monkey_patch" > server2.py
+cat server.py >> server2.py
+
+export PYTHONPATH=$PYTHONPATH:../
+
+VENV_PATH=
+source $VENV_PATH/bin/activate
+python server2.py --chat --listen
+```
+
+Note:
+* quant_attn only support torch 2.0+
+* lora support is only for simple lora with only q_proj and v_proj
+* this patch breaks model selection, lora selection and training feature in webui
+
+# Docker
+
## Quick start for running the chat UI
```
@@ -43,12 +92,13 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
* Removed bitsandbytes from requirements
* Added pip installable branch based on winglian's PR
* Added cuda backend quant attention and fused mlp from GPTQ_For_Llama.
-* Added lora patch for GPTQ_For_Llama triton backend.
-
+* Added lora patch for GPTQ_For_Llama repo triton backend.
+Usage:
```
from monkeypatch.gptq_for_llala_lora_monkey_patch import inject_lora_layers
inject_lora_layers(model, lora_path, device, dtype)
```
+* Added Model server for better inference performance with webui (40% faster than original webui which runs model and gradio in same process)
# Requirements
gptq-for-llama