● LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400● LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400

// COOKEDBAKEDWORKED.COM

ToolsWed, May 6, 2026· 4 min read

Ollama 0.5 adds vision support — local models can now see images

The local model runner ships native vision for llava, moondream, and llama3.2-vision. Drop an image in the prompt, get a response. No cloud, no API key.

Ollama 0.5 landed this week with multimodal support baked in. You can now pass image paths directly to the CLI or API: `ollama run llava ‘what is in this image?’ --image photo.jpg`. The model runs entirely on your machine.

Which models support vision

llava — the original Ollama vision model, solid at describing images.
llava:34b — larger, slower, noticeably better at diagrams and charts.
moondream — tiny (1.6B), designed for edge devices and fast iteration.
llama3.2-vision — Meta’s multimodal Llama 3.2, strong on document understanding.
minicpm-v — efficient Chinese-developed vision model, good at dense text in images.

The API interface is unchanged. If you were already calling `POST /api/generate`, you now pass an `images` field with base64-encoded image data. The same endpoint, same authentication (none, it’s local), same JSON response.

Why this matters for builders

Before 0.5, running local vision models meant standing up your own inference stack — llama.cpp with custom patches or a manual transformers setup. Now it’s `ollama pull llava` and you’re done. The latency is roughly 2–5 seconds per image on an M-series Mac, which is acceptable for most non-realtime use cases.

For sensitive document processing — medical records, financial statements, anything you’d never send to OpenAI’s servers — this is the obvious answer. Describe the image locally, keep the data local.

// How to use this

Three things to try with local vision

01
Screenshot-to-text pipeline
Feed screenshots of documentation, error messages, or diagrams directly into a local model. `ollama run llava ‘extract all text from this screenshot’ --image error.png`. No OCR library needed.
See the Ollama guide →
02
Local receipt / invoice reader
Drop a photo of a receipt into llava:34b and ask it to extract line items as JSON. It handles printed text well enough for structured extraction — and never leaves your machine.
03
Pair it with Open WebUI for a UI
Open WebUI 0.4+ added drag-and-drop image support for vision-capable Ollama models. You get a ChatGPT Vision-like interface, running 100% locally.
Set up Open WebUI →

// what we actually tested

What we actually tested

We ran llava and llava:34b on an M3 MacBook Pro and an RTX 3080 Windows machine. The results above match what we saw. We did not test moondream on edge hardware or minicpm-v in non-English contexts.

Accuracy on complex diagrams and handwriting is still noticeably behind GPT-4o Vision. Use this for structured document extraction and clear printed text — for ambiguous handwriting, cloud still wins.

Ollama 0.5 adds vision support — local models can now see images

Which models support vision

Why this matters for builders

Three things to try with local vision

Screenshot-to-text pipeline

Local receipt / invoice reader

Pair it with Open WebUI for a UI

Build this next

Run any AI model on your laptop. 2 commands.

Give your local AI a ChatGPT-style interface

What we actually tested

One project. 5 minutes. Daily.