alt text A working LLM on my local machine.

oobabooga/text-generation-webui is a front end for running Large Language Models on local hardware.

A LLM is a type of AI system designed to understand, generate, and interact with human language eg Open AI GPT-4, Google PaLM used in Bard, Meta’s LLaMa, Anthropic Claude 2

TextGenWebUI supports

  • Transformers
  • AWQ
  • EXL2
  • llama.cpp (GGML now called GGUF - CPU)
  • Llama - using with Matthew Berman

I used the script to get it working, and not his more manual way of python Although I could run from VSCode straight into just fine.

# cd ~ - doing it from WSL side as had strange filesystem erros from c:/dev/test

# notice the version referes to the version of python you have 


conda update -n base -c defaults conda

# conda create -n textgen python=3.10.9

git clone

# this worked in ~/
# and not having conda started
# selected cpu only

# this just may tick the box on the ui to run cpu only
./ --cpu

# to update

# to delete and start again
# remove `installer_files` directory

And here is the manual install method which didn’t work.

# use manual instructions

conda create -n textgen python=3.11
conda activate textgen

pip3 install torch torchvision torchaudio --index-url

pip3 install -r requirements_cpu_only.txt

# not working - getting error below
# Traceback (most recent call last):
#  File "/mnt/c/dev/textgen/", line 5, in <module>
#    from modules.block_requests import OpenMonkeyPatch, RequestBlocker
#  File "/mnt/c/dev/textgen/modules/", line 4, in <module>
#    import requests
#ModuleNotFoundError: No module named 'requests'


# to clone models from huggingface or just use the TextGenWebUI
sudo apt-get install git-lfs
git lfs install

For me I have to look for CPU models only:

GGML (now called GGUF) - good for CPU only. Only GGUF works for me. GPTQ - GPU only


aka Language Model Quantization - compressing or reducing the size of a LLM

Go for at least Q4 quantization

good notes here


please summerise: (chapter 1 of moby dick)

llama-2-7b-chat.Q4_0.gguf - 192secs. 0.05 tokens/sec.. and it failed. shorter bits of text work. ChatGPT4 did it in a few seconds!


explore different models

look at video from matthew - what are his tests?

clean out all models from directory?

how to train a model?

xx - a model which is based on the text generation model called Mistral-7B which has been fine tuned on the Dolphin dataset - an uncensored dataset.

RunPod to rent GPU



Mistral 7B

Lets see if I can run this model locally?

Performs better than LLaMA 2 apparently to run locallly

langchain too - python - very fast implementations of llama2

LLaMA.cpp 43k stars.

has bindings for Python, C#, Go etc.. and UI like below

OOBABOOGA Gradio web UI for LLMs - text generation

25k starts which supports llama.cpp

Easy to download new models from TheBloke on Huggingface… can load quantised versions too for less gpu ram. Or use cpu instead but slower.

A gradio web UI for stable diffusion

100k stars