LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400 LIVEReading: NewsUpdated: 10 min agoSubscribers: 23,400
CBW

Ollama 0.5 adds vision support — local models can now see images

The local model runner ships native vision for llava, moondream, and llama3.2-vision. Drop an image in the prompt, get a response. No cloud, no API key.

Ollama 0.5 landed this week with multimodal support baked in. You can now pass image paths directly to the CLI or API: `ollama run llava ‘what is in this image?’ --image photo.jpg`. The model runs entirely on your machine.

Which models support vision

The API interface is unchanged. If you were already calling `POST /api/generate`, you now pass an `images` field with base64-encoded image data. The same endpoint, same authentication (none, it’s local), same JSON response.

Why this matters for builders

Before 0.5, running local vision models meant standing up your own inference stack — llama.cpp with custom patches or a manual transformers setup. Now it’s `ollama pull llava` and you’re done. The latency is roughly 2–5 seconds per image on an M-series Mac, which is acceptable for most non-realtime use cases.

For sensitive document processing — medical records, financial statements, anything you’d never send to OpenAI’s servers — this is the obvious answer. Describe the image locally, keep the data local.

// How to use this

Three things to try with local vision

  1. 01

    Screenshot-to-text pipeline

    Feed screenshots of documentation, error messages, or diagrams directly into a local model. `ollama run llava ‘extract all text from this screenshot’ --image error.png`. No OCR library needed.

    See the Ollama guide →
  2. 02

    Local receipt / invoice reader

    Drop a photo of a receipt into llava:34b and ask it to extract line items as JSON. It handles printed text well enough for structured extraction — and never leaves your machine.

  3. 03

    Pair it with Open WebUI for a UI

    Open WebUI 0.4+ added drag-and-drop image support for vision-capable Ollama models. You get a ChatGPT Vision-like interface, running 100% locally.

    Set up Open WebUI →
// what we actually tested

What we actually tested

We ran llava and llava:34b on an M3 MacBook Pro and an RTX 3080 Windows machine. The results above match what we saw. We did not test moondream on edge hardware or minicpm-v in non-English contexts.

Accuracy on complex diagrams and handwriting is still noticeably behind GPT-4o Vision. Use this for structured document extraction and clear printed text — for ambiguous handwriting, cloud still wins.

// daily build

One project. 5 minutes. Daily.

Get tomorrow's best AI project in your email. With a guide that works. Free. No spam.

23,400 builders read this