Ollama 0.5 adds vision support — local models can now see images
The local model runner ships native vision for llava, moondream, and llama3.2-vision. Drop an image in the prompt, get a response. No cloud, no API key.
The local model runner ships native vision for llava, moondream, and llama3.2-vision. Drop an image in the prompt, get a response. No cloud, no API key.
Ollama 0.5 landed this week with multimodal support baked in. You can now pass image paths directly to the CLI or API: `ollama run llava ‘what is in this image?’ --image photo.jpg`. The model runs entirely on your machine.
The API interface is unchanged. If you were already calling `POST /api/generate`, you now pass an `images` field with base64-encoded image data. The same endpoint, same authentication (none, it’s local), same JSON response.
Before 0.5, running local vision models meant standing up your own inference stack — llama.cpp with custom patches or a manual transformers setup. Now it’s `ollama pull llava` and you’re done. The latency is roughly 2–5 seconds per image on an M-series Mac, which is acceptable for most non-realtime use cases.
For sensitive document processing — medical records, financial statements, anything you’d never send to OpenAI’s servers — this is the obvious answer. Describe the image locally, keep the data local.
Feed screenshots of documentation, error messages, or diagrams directly into a local model. `ollama run llava ‘extract all text from this screenshot’ --image error.png`. No OCR library needed.
See the Ollama guide →Drop a photo of a receipt into llava:34b and ask it to extract line items as JSON. It handles printed text well enough for structured extraction — and never leaves your machine.
Open WebUI 0.4+ added drag-and-drop image support for vision-capable Ollama models. You get a ChatGPT Vision-like interface, running 100% locally.
Set up Open WebUI →We ran llava and llava:34b on an M3 MacBook Pro and an RTX 3080 Windows machine. The results above match what we saw. We did not test moondream on edge hardware or minicpm-v in non-English contexts.
Accuracy on complex diagrams and handwriting is still noticeably behind GPT-4o Vision. Use this for structured document extraction and clear printed text — for ambiguous handwriting, cloud still wins.
Get tomorrow's best AI project in your email. With a guide that works. Free. No spam.