Fork of https://github.com/johnsmith0031/alpaca_lora_4bit

Go to file

wesleysanjose b8e2588fbf Fix Dockerfile for No module named 'monkeypatch' Traceback (most recent call last): File "/alpaca_lora_4bit/text-generation-webui/server.py", line 1, in <module> import custom_monkey_patch # apply monkey patch File "/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 6, in <module> from monkeypatch.peft_tuners_lora_monkey_patch import replace_peft_model_with_gptq_lora_model, Linear4bitLt ModuleNotFoundError: No module named 'monkeypatch'		2023-04-14 01:27:44 -07:00
monkeypatch	fix bug	2023-04-13 11:34:53 +08:00
text-generation-webui	update reference	2023-04-13 10:35:10 +08:00
.gitignore	add triton backend support for v2 model	2023-04-07 15:34:06 +08:00
Dockerfile	Fix Dockerfile for No module named 'monkeypatch'	2023-04-14 01:27:44 -07:00
Finetune4bConfig.py	add xformers support	2023-04-12 12:59:44 +08:00
LICENSE	Create LICENSE	2023-03-25 10:17:44 +08:00
README.md	Update README.md	2023-04-12 13:09:48 +08:00
alpaca_lora_4bit_penguin_fact.gif	Add gif.	2023-04-06 00:30:28 -04:00
amp_wrapper.py	add amp_wrapper for autocast support.	2023-03-30 19:57:19 +08:00
arg_parser.py	add xformers support	2023-04-12 12:59:44 +08:00
autograd_4bit.py	update reference	2023-04-13 10:35:10 +08:00
custom_autotune.py	add triton backend support for v2 model	2023-04-07 15:34:06 +08:00
data.txt	add data	2023-03-22 12:13:34 +08:00
finetune.py	Merge pull request #77 from winglian/upstream-peft	2023-04-13 10:25:05 +08:00
gradient_checkpointing.py	Fix repos.	2023-03-25 20:16:48 -07:00
inference.py	fix bug	2023-04-13 10:36:57 +08:00
matmul_utils_4bit.py	fix bug	2023-04-09 12:44:50 +08:00
requirements.txt	Update requirements.txt	2023-04-13 14:44:59 +08:00
train_data.py	fix gpt4all training to more closely match the released logic, other small fixes and optimizations	2023-03-30 22:40:40 -04:00
triton_utils.py	fix bug on triton matmul	2023-04-07 15:50:55 +08:00

README.md

Alpaca Lora 4bit

Made some adjust for the code in peft and gptq for llama, and make it possible for lora finetuning with a 4 bits base model. The same adjustment can be made for 2, 3 and 8 bits.

Quick start for running the chat UI

git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
DOCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . # build step can take 12 min
docker run --gpus=all -p 7860:7860 alpaca_lora_4bit

Point your browser to http://localhost:7860

Results

It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.

Development

Install Manual by s4rduk4r: https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md (NOTE: don't use the install script, use the requirements.txt instead.)
Also Remember to create a venv if you do not want the packages be overwritten.

Update Logs

Resolved numerically unstable issue
Reconstruct fp16 matrix from 4bit data and call torch.matmul largely increased the inference speed.
Added install script for windows and linux.
Added Gradient Checkpointing. Now It can finetune 30b model 4bit on a single GPU with 24G VRAM with Gradient Checkpointing enabled. (finetune.py updated) (but would reduce training speed, so if having enough VRAM this option is not needed)
Added install manual by s4rduk4r
Added pip install support by sterlind, preparing to merge changes upstream
Added V2 model support (with groupsize, both inference + finetune)
Added some options on finetune: set default to use eos_token instead of padding, add resume_checkpoint to continue training
Added offload support. load_llama_model_4bit_low_ram_and_offload_to_cpu function can be used.
Added monkey patch for text generation webui for fixing initial eos token issue.
Added Flash attention support. (Use --flash-attention)
Added Triton backend to support model using groupsize and act-order. (Use --backend=triton)
Added g_idx support in cuda backend (need recompile cuda kernel)
Added xformers support
Removed triton, flash-atten from requirements.txt for compatibility

Requirements

gptq-for-llama
peft
The specific version is inside requirements.txt

Install

~copy files from GPTQ-for-LLaMa into GPTQ-for-LLaMa path and re-compile cuda extension~
~copy files from peft/tuners/lora.py to peft path, replace it~

NOTE: Install scripts are no longer needed! requirements.txt now pulls from forks with the necessary patches.

pip install -r requirements.txt

Finetune

~The same finetune script from https://github.com/tloen/alpaca-lora can be used.~

After installation, this script can be used: GPTQv1:

python finetune.py

GPTQ_VERSION=1 python finetune.py

GPTQv2:

GPTQ_VERSION=2 python finetune.py

Inference

After installation, this script can be used:

python inference.py

Text Generation Webui Monkey Patch

Clone the latest version of text generation webui and copy all the files into ./text-generation-webui/

git clone https://github.com/oobabooga/text-generation-webui.git

Open server.py and insert a line at the beginning

import custom_monkey_patch # apply monkey patch
import gc
import io
...

Use the command to run

python server.py

Flash Attention

It seems that we can apply a monkey patch for llama model. To use it, simply download the file from MonkeyPatch. And also, flash-attention is needed, and currently do not support pytorch 2.0. Just add --flash-attention to use it for finetuning.

Xformers

Install

pip install xformers

Usage

from monkeypatch.llama_attn_hijack_xformers import hijack_llama_attention
hijack_llama_attention()