oobabooga/text-generation-webui is a front end for running Large Language Models on local hardware.
- GPTQ (GPUS)
- llama.cpp (GGML now called GGUF - CPU)
https://www.youtube.com/watch?v=k2FHUP0krqg - using with Matthew Berman
I used the
start_linux.sh script to get it working, and not his more manual way of
python server.py. Although I could run from VSCode straight into server.py just fine.
# cd ~ - doing it from WSL side as had strange filesystem erros from c:/dev/test # notice the version referes to the version of python you have # https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.5.2-0-Linux-x86_64.sh conda update -n base -c defaults conda # conda create -n textgen python=3.10.9 git clone https://github.com/oobabooga/text-generation-webui.git # this worked in ~/ # and not having conda started # selected cpu only ./start_linux.sh # this just may tick the box on the ui to run cpu only ./start_linux.sh --cpu # to update ./update_linux.sh # to delete and start again # remove `installer_files` directory
And here is the manual install method which didn’t work.
# use manual instructions # https://github.com/oobabooga/text-generation-webui#manual-installation-using-conda conda create -n textgen python=3.11 conda activate textgen pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu pip3 install -r requirements_cpu_only.txt # not working - getting error below python3 server.py # Traceback (most recent call last): # File "/mnt/c/dev/textgen/server.py", line 5, in <module> # from modules.block_requests import OpenMonkeyPatch, RequestBlocker # File "/mnt/c/dev/textgen/modules/block_requests.py", line 4, in <module> # import requests #ModuleNotFoundError: No module named 'requests'
# to clone models from huggingface or just use the TextGenWebUI sudo apt-get install git-lfs git lfs install
For me I have to look for CPU models only:
GGML (now called GGUF) - good for CPU only. Only GGUF works for me. GPTQ - GPU only
TheBloke/Llama-2-7b-Chat-GGUF - works!
https://huggingface.co/TheBloke/Llama-2-7B-GGUF - what is the difference between the Chat variant above and this?
https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF what is the instruct variant above?
aka Language Model Quantization - compressing or reducing the size of a LLM
Go for at least Q4 quantization
please summerise: (chapter 1 of moby dick)
llama-2-7b-chat.Q4_0.gguf - 192secs. 0.05 tokens/sec.. and it failed. shorter bits of text work. ChatGPT4 did it in a few seconds!
explore different models
look at video from matthew - what are his tests?
clean out all models from directory?
how to train a model?
https://huggingface.co/ehartford/dolphin-2.0-mistral-7b - a model which is based on the text generation model called Mistral-7B which has been fine tuned on the Dolphin dataset - an uncensored dataset.
RunPod to rent GPU
Lets see if I can run this model locally?
Performs better than LLaMA 2 apparently
https://ollama.ai/ to run locallly
langchain too - python
labs.perplexity.ai - very fast implementations of llama2
https://github.com/ggerganov/llama.cpp 43k stars.
has bindings for Python, C#, Go etc.. and UI like below
OOBABOOGA Gradio web UI for LLMs - text generation
https://github.com/oobabooga/text-generation-webui which supports llama.cpp
Easy to download new models from TheBloke on Huggingface… can load quantised versions too for less gpu ram. Or use cpu instead but slower.
A gradio web UI for stable diffusion