ChatGPT4 Vision
Update 15th Jan 2024 - this is a great API and describes images very well for my use case. I’m comparing it with open source - LLaVA and Mixtral 7B.
Describe an Image
I’m working in human rights investigations. To mitigate trauma for myself and other investigators, I want the AI to descibe in non emotional terms what is in the image.
could you give a 1 sentence general description of this image please. Also identify objects, people, scenes. Can you also tell me if this would be classified as a traumatic picture for someone to look at
Simple Description in 5 words
describe this image in 5 words
describe this image in 1 sentence using non emotional language
Traumatic Rating
give traumatic rating on a scale of 1 - 5
Even these 2 very simple queries are super useful.
Chat GPT-4 API
https://platform.openai.com/ and the vision documentation.
# I couldn't get conda working so using pipenv (see previous blog article on virtual envs for more detail)
# created a new Pipfile defaulting to python 3.11.6.. my base is 3.10.12
pipenv install
# to explore inside the shell
pipenv shell
# run the program
pipenv run foo.py
# add to Pipfile
openai = "*"
# get new dependencies
pipenv update
# remember in VSCode to select the interpreter inside the pipenv environment
Frist program:
from openai import OpenAI
# notice that it magically loaded the api key from the .env file
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
{"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
]
)
message = completion.choices[0].message
content = message.content
# displays \n as line feeds
print(content)
https://platform.openai.com/usage - can see that running this a few times costs a tiny bit of mnoney eg 0.01
Vision
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])
Output is
The image shows a wooden boardwalk pathway extending into the distance through a lush green wetland area or meadow. The grass and vegetation on either side are tall and dense, indicating a healthy, natural habitat that could be home to a variety of wildlife. The sky is predominantly blue with some scattered white clouds. The lighting suggests it could be late afternoon or early evening when the photo was taken, due to the soft glow and long shadows. The scenery looks peaceful and serene, ideal for a nature walk or to simply enjoy the outdoor environment
cost was around $0.015. It tool 1,226 tokens for this 1 request.
Send an image
Rather than a public URL we can send an image as a base64 encoded verstion.
import base64
import requests
from dotenv import load_dotenv
import os
# lets load from the .env file using python-dotenv which is in Pipfile
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image
image_path = "pics/hchestnut.jpg"
# Getting the base64 string
base64_image = encode_image(image_path)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
"max_tokens": 300
}
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
print(response.json())
The image shows a cluster of white flowers with prominent pink spots and numerous stamens extending outward, creating a frilly appearance. The flowers are likely part of a single inflorescence, which is a cluster of flowers arranged on a stem that is composed of a main branch or a complicated arrangement of branches. The inflorescence is surrounded by green leaves with serrated margins. The flowers could possibly belong to a tree or large shrub, and the structure suggests that they might be part of the family Rosaceae, which includes many flowering trees and shrubs, though without more specific information or context, it is difficult to identify the exact species. The overall setting appears to be an outdoor area with abundant foliage, indicative of a garden or natural area.
Specific questions
This question sometimes didn’t work on ChatGPT-4
Is this a sensitive picture? please give 1 word summary. give a rating from 1 to 5 as to how sensitive it is please. summerise in 1 sentence
Errors like:
I’m sorry, I can’t assist with this request.
It seemed to be on more sensitive photos. Also it could be the max_tokens limit.
Final Code
Here is what I used in my final production code with prompt engineering thought out.
code is in gpt-vision-api/7spreadsheet-chatgpt
Structure of test project
import base64
from dotenv import load_dotenv
import os
import gspread
from loguru import logger
from openai import OpenAI
import json
def call_gpt_vision(image_path, text):
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Getting the base64 string
base64_image = encode_image(image_path)
# do I even need the api key?
client = OpenAI(api_key=api_key)
completion = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": text},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
},
},
],
}
],
# https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format
# response_format="json_object",
# max_tokens=2000,
max_tokens=600,
stream=True
)
content = ""
for chunk in completion:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="",flush=True)
content = content + chunk.choices[0].delta.content
logger.debug(content)
return content
def main():
# note 0 based when using get_all_values as referencing a python list
# for writing using gspread I have to add 1
entry_number_column_index = 0
llm_violence_column_index = 4
llm_5_words_column_index = 5
llm_1_sentence_column_index = 6
llm_full_column_index = 7
archive_status_column_index = 9
# read Entry Number from spreadsheet
# look for directory with same name
# send image to be analysed
# write result to spreadsheet
# 1.spreadsheeet
# Authenticate using the JSON key file
gc = gspread.service_account(filename='secrets/service_account.json')
# Open the spreadsheet by its title or URL
spreadsheet_title = 'AA Demo Main'
sh = gc.open(spreadsheet_title)
worksheet = sh.sheet1
# Read data from the worksheet
# get all values - so we don't do calls on each row
all_values = worksheet.get_all_values()
# iterate over each Entry Number
row_index = 1
# for entry_number in entry_numbers[1:6]: # skip first row which is the header, and slice.. gives up to 004
for row in all_values[1:]: # skip first row which is the header, and slice
entry_number = row[entry_number_column_index]
row_index = row_index + 1
if entry_number is None or entry_number == "":
# logger.debug(f'ignoring row as no entry number: {row_index}')
continue
# only process if archiver status is blank
archive_status = row[archive_status_column_index]
if archive_status is None or archive_status == "":
pass
else:
# logger.debug(f'ignoring row: {row_index}')
continue
logger.info(f'row_index: {row_index} entry_number: {entry_number}')
# does a directory exist with this name?
subdirectory = f'pics/{entry_number}'
if not os.path.exists(subdirectory) or not os.path.isdir(subdirectory):
# logger.debug(f"The specified path is not a valid directory: {subdirectory}")
continue # the for loop
files = []
for f in os.listdir(subdirectory):
if os.path.isfile(os.path.join(subdirectory, f)):
if f.lower().endswith('.jpg'):
files.append(f)
# Iterate over each image file in the subdirectory
for file in files:
logger.debug(f'Calling ChatGPT-4 to analyse this image {file} for Entry Number {entry_number}')
# Path and filename
image_path = subdirectory + '/' + file
# 1. Traumatic Violence Level
#Foo1
text = """
You are a senior human rights investigator.
You are investigating war crimes and human rights violations in myanmar.
Analyse this photo and give a long summary of it. You should focus on anything to do with human rights eg war, guns, blood, destruction
Give a rating from 1 to 5 as to how sensitive / traumatic / violence level for a person to look at it would be.
For example an image with human deaths or graphic human suffering would be a 5 (very severe), burned out village with property damage would be a 3,
destroyed buildings 3, whereas an landscape scene / nothing traumatic would be a 1 (no violence), civilian equipment eg a crane would be a 1.
Only give a high rating if it is truly horrible or traumatic - ie a 5 should only be given for human remains etc..
Use clear and concise non emotive language in formal tone.
Output in a valid json format with header: rating, summary, shortsummary
"""
# Lets think step by step, and give a detailed reason why you gave a rating. Put this into the stepbystep output.
# Foo2
# text = """
# Describe in detail what you see in this image. You are a human rights investigator. This image is from a war torn part of the world
# """
gpt_result = call_gpt_vision(image_path, text)
# logger.debug(f'llm violence result is {gpt_result}')
# # put into LLM full column for testing
# # maybe another model is better at doing 1 - 5 violence level eg mixtral 7b
# worksheet.update_cell(row_index, llm_full_column_index +1, gpt_result)
# continue
#Foo2 end
# Convert to GPT-4 output to Python dictionary
# for conversion to work I need to
# strip off ```json at start
# strip off ``` at end
foo = gpt_result.replace("```json\n", "", 1)
foo = foo.replace("```", "", 1)
try:
data = json.loads(foo)
except:
logger.error("Can' decode json")
continue
rating = str(data["rating"])
summary = data["summary"]
# shortsummary = data["shortsummary"]
shortsummary = data.get('shortsummary', '')
# 1.violence level
current_value = worksheet.cell(row_index, llm_violence_column_index+1).value
new_value = ""
foo = ""
if current_value:
new_value = current_value + '\n\n' + foo + rating
else:
new_value = foo + rating
worksheet.update_cell(row_index, llm_violence_column_index+1, new_value)
# 2. Describe in 5 words
current_value = worksheet.cell(row_index, llm_5_words_column_index+1).value
new_value = ""
foo = ""
if current_value:
new_value = current_value + '\n\n' + foo + shortsummary
else:
new_value = foo + shortsummary
worksheet.update_cell(row_index, llm_5_words_column_index+1, new_value)
# 3. Describe in 1 sentence
current_value = worksheet.cell(row_index, llm_1_sentence_column_index+1).value
new_value = ""
foo = ""
if current_value:
new_value = current_value + '\n\n' + foo + summary
else:
new_value = foo + summary
worksheet.update_cell(row_index, llm_1_sentence_column_index+1, new_value)
if __name__ == "__main__":
logger.add("logs/0trace.log", level="TRACE", rotation="00:00")
main()