Merge readmes.
This commit is contained in:
parent
e854f5d111
commit
2e5aaf6dd6
104
README.md
104
README.md
|
|
@ -1,21 +1,11 @@
|
|||
# Run LLM chat in realtime on an 8GB NVIDIA GPU
|
||||
# Alpaca Lora 4bit
|
||||
Made some adjust for the code in peft and gptq for llama, and make it possible for lora finetuning with a 4 bits base model. The same adjustment can be made for 2, 3 and 8 bits.
|
||||
|
||||
## Dockerfile for alpaca_lora_4bit
|
||||
This repo is a Dockerfile wrapper for https://github.com/johnsmith0031/alpaca_lora_4bit
|
||||
|
||||
## Use
|
||||
Can run real-time LLM chat using alpaca on a 8GB NVIDIA/CUDA GPU (ie 3070 Ti mobile)
|
||||
|
||||
## Requirements
|
||||
- Linux
|
||||
- Docker
|
||||
- NVIDIA GPU with driver version that supports CUDA 11.7+ (e.g. 525)
|
||||
|
||||
## Installation
|
||||
## Quick start for running the chat UI
|
||||
|
||||
```
|
||||
git clone https://github.com/andybarry/alpaca_lora_4bit_docker.git
|
||||
docker build -t alpaca_lora_4bit .
|
||||
DOCCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit .
|
||||
docker run --gpus=all -p 7860:7860 alpaca_lora_4bit
|
||||
```
|
||||
Point your browser to http://localhost:7860
|
||||
|
|
@ -25,12 +15,88 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
|
|||
|
||||

|
||||
|
||||
The model isn't all that good, sometimes it goes crazy. But hey, as I always say, "when 4-bits _you reach_ look as good, you will not."
|
||||
# Development
|
||||
* Install Manual by s4rduk4r: https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md (**NOTE:** don't use the install script, use the requirements.txt instead.)
|
||||
* Also Remember to create a venv if you do not want the packages be overwritten.
|
||||
|
||||
# Update Logs
|
||||
* Resolved numerically unstable issue
|
||||
* Reconstruct fp16 matrix from 4bit data and call torch.matmul largely increased the inference speed.
|
||||
* Added install script for windows and linux.
|
||||
* Added Gradient Checkpointing. Now It can finetune 30b model 4bit on a single GPU with 24G VRAM with Gradient Checkpointing enabled. (finetune.py updated) (but would reduce training speed, so if having enough VRAM this option is not needed)
|
||||
* Added install manual by s4rduk4r
|
||||
* Added pip install support by sterlind, preparing to merge changes upstream
|
||||
* Added V2 model support (with groupsize, both inference + finetune)
|
||||
* Added some options on finetune: set default to use eos_token instead of padding, add resume_checkpoint to continue training
|
||||
* Added offload support. load_llama_model_4bit_low_ram_and_offload_to_cpu function can be used.
|
||||
* Added monkey patch for text generation webui for fixing initial eos token issue.
|
||||
* Added Flash attention support. (Use --flash-attention)
|
||||
* Added Triton backend to support model using groupsize and act-order. (Use --backend=triton)
|
||||
|
||||
## References
|
||||
# Requirements
|
||||
gptq-for-llama <br>
|
||||
peft<br>
|
||||
The specific version is inside requirements.txt<br>
|
||||
|
||||
- https://github.com/johnsmith0031/alpaca_lora_4bit
|
||||
- https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md
|
||||
- https://github.com/tloen/alpaca-lora
|
||||
# Install
|
||||
~copy files from GPTQ-for-LLaMa into GPTQ-for-LLaMa path and re-compile cuda extension~<br>
|
||||
~copy files from peft/tuners/lora.py to peft path, replace it~<br>
|
||||
|
||||
**NOTE:** Install scripts are no longer needed! requirements.txt now pulls from forks with the necessary patches.
|
||||
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
# Finetune
|
||||
~The same finetune script from https://github.com/tloen/alpaca-lora can be used.~<br>
|
||||
|
||||
After installation, this script can be used:
|
||||
GPTQv1:
|
||||
|
||||
```
|
||||
python finetune.py
|
||||
```
|
||||
or
|
||||
```
|
||||
GPTQ_VERSION=1 python finetune.py
|
||||
```
|
||||
|
||||
GPTQv2:
|
||||
```
|
||||
GPTQ_VERSION=2 python finetune.py
|
||||
```
|
||||
|
||||
# Inference
|
||||
|
||||
After installation, this script can be used:
|
||||
|
||||
```
|
||||
python inference.py
|
||||
```
|
||||
|
||||
# Text Generation Webui Monkey Patch
|
||||
|
||||
Clone the latest version of text generation webui and copy all the files into ./text-generation-webui/
|
||||
```
|
||||
git clone https://github.com/oobabooga/text-generation-webui.git
|
||||
```
|
||||
|
||||
Open server.py and insert a line at the beginning
|
||||
```
|
||||
import custom_monkey_patch # apply monkey patch
|
||||
import gc
|
||||
import io
|
||||
...
|
||||
```
|
||||
|
||||
Use the command to run
|
||||
|
||||
```
|
||||
python server.py
|
||||
```
|
||||
|
||||
# Flash Attention
|
||||
|
||||
It seems that we can apply a monkey patch for llama model. To use it, simply download the file from [MonkeyPatch](https://github.com/lm-sys/FastChat/blob/daa9c11080ceced2bd52c3e0027e4f64b1512683/fastchat/train/llama_flash_attn_monkey_patch.py). And also, flash-attention is needed, and currently do not support pytorch 2.0.
|
||||
Just add --flash-attention to use it for finetuning.
|
||||
|
|
|
|||
Loading…
Reference in New Issue