PATTERN · 01 · LAB

Vision LLMs classifying damaged pallet blocks.

When the dataset is small and the failure modes are describable in words, a vision LLM with a tight prompt beats a fine-tune. With a fraction of the eval overhead. A note on when to reach for which.

Vision LLMFew-shotIndustrial QCAfter: Anthropic / OpenAI vision docs
THE PROBLEM
Sort a stack of returned pallets. Most are fine. Some are cracked, splintered, or rotted.

Pallet pooling depots receive thousands of returned pallets a day. A small fraction have damaged blocks. The four wooden cubes that bear the load. And have to be pulled before they go back into circulation. Damaged blocks aren't visually subtle: cracks, missing chunks, deep splinters. A human can call it in a second. The question is how to do it at depot throughput without hiring twenty more humans.

The instinct is to train a small CNN: take 50,000 images, label each block as OK/damaged, train a model that runs on cheap hardware. This works. It's also a six-month project with a labeling vendor, an MLOps pipeline, and a retraining cadence the depot doesn't want to own.

THE PATTERN
Vision LLM + tight prompt + structured output.

Skip the training. Feed an image of each block face to a vision LLM. Ask, in the prompt, what counts as damage: visible crack longer than X cm, missing chunk larger than Y, etc. Provide three or four few-shot examples. Labeled images of borderline cases. Directly in the prompt. Get back a structured JSON: { status: "ok" | "damaged", confidence: 0-1, reason: string }.

Two things matter for this to work cheaply at depot throughput. First, the prompt itself is the model. Every depot's damage criteria is slightly different, and the prompt can encode that without retraining. Second, the structured output makes the eval loop trivial: run the model on a held-out set of human-labeled images, count agreements, iterate the prompt where it disagrees.

WHEN TO REACH FOR THIS
Three conditions.
01
Small dataset
You have hundreds of labeled examples, not millions. Fine-tuning is over-engineering.
02
Describable criteria
A human can explain the rule in two sentences. The prompt can encode it.
03
Per-call cost < per-call value
Each classification costs ~€0.005–€0.02. Math works above thousands-of-decisions-per-day with non-trivial cost-per-error.
WHEN NOT TO
Three failure modes.

Sub-100ms latency requirements (vision LLMs are slower than a CNN running on-device). Criteria that can't be verbalized (subtle defects only a trained eye catches). Massive throughput where the per-image cost stacks up faster than a one-time training investment would.

WHAT WE'D BRING TO A CLIENT ENGAGEMENT
A working prototype in week one. A real eval against your labeled images in week two. A cost-per-decision number you can show the CFO. The unit-economics conversation comes before the architecture conversation.
PRIOR ART · STANDING ON SHOULDERS
  • Anthropic. Vision capabilities docs & cookbook
  • OpenAI. GPT-4 vision few-shot patterns
  • Industrial CV literature on pallet-block defect classification (early-2020s)