|
|
||
|---|---|---|
| text-generation-webui | ||
| .gitignore | ||
| Dockerfile | ||
| Finetune4bConfig.py | ||
| LICENSE | ||
| README.md | ||
| alpaca_lora_4bit_penguin_fact.gif | ||
| amp_wrapper.py | ||
| arg_parser.py | ||
| autograd_4bit.py | ||
| data.txt | ||
| finetune.py | ||
| gradient_checkpointing.py | ||
| inference.py | ||
| matmul_utils_4bit.py | ||
| requirements.txt | ||
| requirements2.txt | ||
| train_data.py | ||
| triton_utils.py | ||
README.md
Run LLM chat in realtime on an 8GB NVIDIA GPU
Dockerfile for alpaca_lora_4bit
Based on https://github.com/johnsmith0031/alpaca_lora_4bit
Use
Can run real-time LLM chat using alpaca on a 8GB NVIDIA/CUDA GPU (ie 3070 Ti mobile)
Requirements
- Linux
- Docker
- NVIDIA GPU with driver version that supports CUDA 11.7+ (e.g. 525)
Installation
git clone https://github.com/andybarry/alpaca_lora_4bit_docker.git
docker build -t alpaca_lora_4bit .
docker run --gpus=all -p 7860:7860 alpaca_lora_4bit
Point your browser to http://localhost:7860
Results
It's fast on a 3070 Ti mobile. Uses 5-6 GB of GPU RAM.
The model isn't all that good, sometimes it goes crazy. But hey, as I always say, "when 4-bits you reach look this good you will not."
