From 2e5aaf6dd6b293bf24d7e6da0080e19473398e5d Mon Sep 17 00:00:00 2001 From: Andy Barry Date: Sat, 8 Apr 2023 01:14:54 -0400 Subject: [PATCH] Merge readmes. --- README.md | 104 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 85 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 252f4d1..dc879ab 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,11 @@ -# Run LLM chat in realtime on an 8GB NVIDIA GPU +# Alpaca Lora 4bit +Made some adjust for the code in peft and gptq for llama, and make it possible for lora finetuning with a 4 bits base model. The same adjustment can be made for 2, 3 and 8 bits. -## Dockerfile for alpaca_lora_4bit -This repo is a Dockerfile wrapper for https://github.com/johnsmith0031/alpaca_lora_4bit - -## Use -Can run real-time LLM chat using alpaca on a 8GB NVIDIA/CUDA GPU (ie 3070 Ti mobile) - -## Requirements -- Linux -- Docker -- NVIDIA GPU with driver version that supports CUDA 11.7+ (e.g. 525) - -## Installation +## Quick start for running the chat UI ``` git clone https://github.com/andybarry/alpaca_lora_4bit_docker.git -docker build -t alpaca_lora_4bit . +DOCCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . docker run --gpus=all -p 7860:7860 alpaca_lora_4bit ``` Point your browser to http://localhost:7860 @@ -25,12 +15,88 @@ It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM. ![](alpaca_lora_4bit_penguin_fact.gif) -The model isn't all that good, sometimes it goes crazy. But hey, as I always say, "when 4-bits _you reach_ look as good, you will not." +# Development +* Install Manual by s4rduk4r: https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md (**NOTE:** don't use the install script, use the requirements.txt instead.) +* Also Remember to create a venv if you do not want the packages be overwritten. +# Update Logs +* Resolved numerically unstable issue +* Reconstruct fp16 matrix from 4bit data and call torch.matmul largely increased the inference speed. +* Added install script for windows and linux. +* Added Gradient Checkpointing. Now It can finetune 30b model 4bit on a single GPU with 24G VRAM with Gradient Checkpointing enabled. (finetune.py updated) (but would reduce training speed, so if having enough VRAM this option is not needed) +* Added install manual by s4rduk4r +* Added pip install support by sterlind, preparing to merge changes upstream +* Added V2 model support (with groupsize, both inference + finetune) +* Added some options on finetune: set default to use eos_token instead of padding, add resume_checkpoint to continue training +* Added offload support. load_llama_model_4bit_low_ram_and_offload_to_cpu function can be used. +* Added monkey patch for text generation webui for fixing initial eos token issue. +* Added Flash attention support. (Use --flash-attention) +* Added Triton backend to support model using groupsize and act-order. (Use --backend=triton) -## References +# Requirements +gptq-for-llama
+peft
+The specific version is inside requirements.txt
-- https://github.com/johnsmith0031/alpaca_lora_4bit -- https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md -- https://github.com/tloen/alpaca-lora +# Install +~copy files from GPTQ-for-LLaMa into GPTQ-for-LLaMa path and re-compile cuda extension~
+~copy files from peft/tuners/lora.py to peft path, replace it~
+**NOTE:** Install scripts are no longer needed! requirements.txt now pulls from forks with the necessary patches. + +``` +pip install -r requirements.txt +``` + +# Finetune +~The same finetune script from https://github.com/tloen/alpaca-lora can be used.~
+ +After installation, this script can be used: +GPTQv1: + +``` +python finetune.py +``` +or +``` +GPTQ_VERSION=1 python finetune.py +``` + +GPTQv2: +``` +GPTQ_VERSION=2 python finetune.py +``` + +# Inference + +After installation, this script can be used: + +``` +python inference.py +``` + +# Text Generation Webui Monkey Patch + +Clone the latest version of text generation webui and copy all the files into ./text-generation-webui/ +``` +git clone https://github.com/oobabooga/text-generation-webui.git +``` + +Open server.py and insert a line at the beginning +``` +import custom_monkey_patch # apply monkey patch +import gc +import io +... +``` + +Use the command to run + +``` +python server.py +``` + +# Flash Attention + +It seems that we can apply a monkey patch for llama model. To use it, simply download the file from [MonkeyPatch](https://github.com/lm-sys/FastChat/blob/daa9c11080ceced2bd52c3e0027e4f64b1512683/fastchat/train/llama_flash_attn_monkey_patch.py). And also, flash-attention is needed, and currently do not support pytorch 2.0. +Just add --flash-attention to use it for finetuning.