LLM Open Source Image Analysis - LLaVA

Previously I’ve looked at running an LLM locally on my CPU with TextGenerationWebUI

Also I’ve looked at ChatGPT-4 vision for my use case of:

give a traumatic rating of 1 to 5 (so human rights investigators are warned of graphic images)
describe the image in 5 words
describe the image in 1 sentence in non emotional language

This is an exploration into whether open source models can achieve a similar (good!) performance.

Update 18th Jan 2024 - https://github.com/djhmateer/gpt-vision-api. llava_and_mixtral.py works, but inconsistent results eg json not coming back. Used LMStudio locally. ChatGPT-4 more consistent and higher quality results.

LM Studio

https://lmstudio.ai/ is an easy way to run models locally.

Version 0.2.9 as of 14th Dec 23.

Supports any ggml now called gguf models ie for CPUS. Look HuggingFace eg

LLaVA - Large Language and Vision Assistant

https://llava-vl.github.io/ - LLaVA-v1.5 was trained in Sept 23.

https://huggingface.co/jartine/llava-v1.5-7B-GGUF - updated 14th Dec. Main files 3 weeks ago.

Notice - need to download the vision adapter too

Trying the Q4 model first (smaller) at 4GB. There is a Q8 (8GB) and f16 (13GB)

LLaVA successfully working on CPU with vision adapter.

https://llava.hliu.cc/ live demo which runs on a GPU - very useful for prompt testing.

Python - Q4 - 4GB

Lets try to connect to my locally running model using Python

Notice that I’m referring to the ip address of my windows machine eg 192.168.1.191 as I’m calling Windows from WSL2.

from openai import OpenAI
import base64
import requests
import time

start_time = time.time()

# Point to the local server
# client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

# using windows IP as I'm calling Python from WSL2 side
client = OpenAI(base_url="http://192.168.1.191:1234/v1", api_key="not-needed")

# image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Adelie_penguins_in_the_South_Shetland_Islands.jpg/640px-Adelie_penguins_in_the_South_Shetland_Islands.jpg"

# 1.Download the image and encode it to base64
# response = requests.get(image_url)
# base64_image = base64.b64encode(response.content).decode('utf-8')

# 2.Get image locally
image_path = f'pics/hchestnut.jpg'

def encode_image(image_path):
      with open(image_path, "rb") as image_file:
          return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image(image_path)

completion = client.chat.completions.create(
  model="local-model", # not used
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          },
        },
      ],
    }
  ],
  max_tokens=1000,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="",flush=True)

end_time = time.time()
print(f"Time elapsed: {end_time - start_time} seconds")

1:03 to run it from the UI 0:40 to run from python

292seconds - using E(fficiency) cores… not good. Need to maximise LM Studio app to run on P cores too.

Notice I’m thermal throttling.

Core temp and Power Draw graphs are useful.

0:35 is best time.

The image depicts four penguins standing on top of a snow-covered iceberg. They are scattered across the scene, with two close to each other in the center and two others standing apart at different corners of the iceberg. The background features mountains covered in white snow. It appears to be a peaceful moment as the penguins enjoy their frosty habitat

Q8 Model - 7GB

0:58 to run

Q13 Model - 13GB

89s to run

1.Giving a trauma rating

This is where I’m finding ChatGPT4 is much better than LLaVA:

This image is from a war torn area. give a rating from 1 to 5 as to how sensitive it is please.

https://llava.hliu.cc/ this works really well. It uses llava-v1.5-13b on a GPU.

PsiPi/liuhaotian_llava-v1.5-13b-GGUF - lets try this model locally

I want to give a 1 to 5 rating of an image on how traumatic it is to look at. eg if it shows human suffering, upsetting images,

This is to mitigate trauma for human rights researchers

ChatGPT4 was resonable with this prompt: Is this a sensitive picture? please give 1 word summary. give a rating from 1 to 5 as to how sensitive it is please. summerise in 1 sentence

It gave a godo summery like:

Summary: Nature Rating: 1 Sentence: The image shows a close-up of a flowering plant with a cluster of white flowers featuring hints of pink, set against a backdrop of green foliage, and appears to be non-sensitive content.

LLaVA not so good

The image features a tree with little white flowers in the middle of its branches. The flowers are clustered together and look quite attractive. The tree is situated near some green leaves, which add to the overall visual appeal of the scene.

2.Summerise Image

summerise this image seems to work eg

A group of purple flowers blooms in the branches of a tree.

The image showcases a white flowering tree with pink blossoms, creating a beautiful scene. The tree is filled with tiny white flowers and draped in pink flora. These delicate petals are scattered throughout the tree, with some near the top, middle, and lower branches. The vibrant display of flowers creates an appealing and serene atmosphere.

Parameters

Nothing working that well so far.

temp - reduced to 0

Other Vision Models

llava v1.5 7B Q4, Q8, f16

llava v1.5 13B Q2, Q4, Q5

BakLLaVA 7B Q5

obsidian-3B-f16 gave nothing

Mixtral 8x7B - not vision but Apache 2.0 license. https://erichartford.com/dolphin-25-mixtral-8x7b

Prompt Engineering

Could it be that I can get better, and more reliable results with better prompting?

https://promptperfect.jina.ai/

https://www.youtube.com/watch?v=jC4v5AS4RIM

This is in order of importance:

task - mandatory
context - important
exemplar - important
persona - nice
format - nice
tone - nice

I'm a 87kg male, give me a 3-month training program

Task

Always start Task sentence with Action Verb eg Generate, Give, Write, Analyze

Generate a 3 month training program

Or multi task request

Analyse the collected feedbackj Summerize the top 3 takeaways Categorize the rest based on team responsible...

Context

Give just enough information to constrain the endless possibilities

What is the users background?
What does success look like?
What environment are they in?

I'm a 87kg male looking to put on 5 kg of muscle mass over the next 3 months

Exemplars

Gives

Here is the task:

Re-write this bullet point using this structure: I accomplised X by the measure Y that resulted in Z

Exemplar

For example: "I lowered the hospital mortality rate by 10% by educating nurses in new protocols which translates to 200 lives saved per year.

Exemplar 2

Persona

Think of an expert who you would like to ask this question

physiotherapist
hiring manager
senior product marketing manager who is good at story telling

or famous invididuals

act like a legendary investor Warren Buffet
you are the master storyteller Steve Jobs

Format

Close eyes and imagine what you would like the end result to look like.

output in table format with column headers: Feedback, Team, and Priority
output in json format
output in markdown
ouutput in paragraphs

I just received an industry report from my director first give me the 3 key takeaways them summerize based on topic use h2 as section headers here’s the report:

Proofread

Proofread this email below and correct all typos and grammer mistakes. Bold all changes you make.

Tone

Use a casual tone of voice Use a format tone of voice Give me a witty output Show enthusiasm Sound pessimistic

But try and convey tone by describing

I'm writing an email and I want to be taken seriously without coming off as too stuck up and cringy. Ca you please give me a list of 5 tone keywords I can include in a prompt for ChatGPT?

Final example

I’ve found that for giving a trauma rating that ChatGPT4 is very good and is consistently the best.

For open source I’ve found this approach to work well:

LLaVA for image analysis to output a detailed description (jartine/llava 7B Q8_0)
Mixtral 7B for giving a trauma rating (TheBloke/Mixtral 7B Q4_0)

And prompt engineering you can see:

Results are not consistent, but are generally good.

LLaVA

`You are a senior human rights investigator.

You are investigating war crimes and human rights violations.

Analyse this photo and give a long detailed description of it. Give at least 100 words.

You should focus on anything to do with human rights eg war, guns, blood, destruction

Use clear and concise non emotive language in formal tone.`

Mixtral

` Given this text below in quotes which describes a photo

Give a rating from 1 to 5 as to how sensitive / traumatic / violence level for a person to look at it would be.

show me step by step why you got this number

“{gpt_vision_result}”

please output as json `

LLaVA/Mixtral run - notice it thought a crane image (in the UK) pointed to suspicious activity, as I’d pointed it towards this type of analysis

ChatGPT4 run - more consistent results even though my prompt was:

` You are a senior human rights investigator.

You are investigating war crimes and human rights violations in myanmar.

Analyse this photo and give a long summary of it. You should focus on anything to do with human rights eg war, guns, blood, destruction

Give a rating from 1 to 5 as to how sensitive / traumatic / violence level for a person to look at it would be.

For example an image with human deaths or graphic human suffering would be a 5 (very severe), burned out village with property damage would be a 3, destroyed buildings 3, whereas an landscape scene / nothing traumatic would be a 1 (no violence), civilian equipment eg a crane would be a 1. Only give a high rating if it is truly horrible or traumatic - ie a 5 should only be given for human remains etc..

Use clear and concise non emotive language in formal tone.

Output in a valid json format with header: rating, summary, shortsummary `

LM Studio

LLaVA - Large Language and Vision Assistant

Python - Q4 - 4GB

Q8 Model - 7GB

Q13 Model - 13GB

1.Giving a trauma rating

2.Summerise Image

Parameters

Other Vision Models

Prompt Engineering

Task

Context

Exemplars

Exemplar 2

Persona

Format

Proofread

Tone

Final example

Multi-Modal Image Analysis

LLaVA

Mixtral