Vision

Enable AI agents to interpret images alongside text for richer understanding and multimodal interactions.

Overview

Vision capabilities allow your agents to analyze images, understand visual content, and respond to questions about what they see. This is useful for image analysis, UI review, document processing, and more.

Note

Vision is supported by most modern models including GPT-5, Claude Sonnet 4.5, and Claude Opus 4.5.

Basic Usage

Send images using the content array format with image URLs or base64 data:

vision-basic.ts

Using Base64 Images

For local images or when you need to embed the image data directly:

vision-base64.ts

Multiple Images

You can include multiple images in a single request:

vision-multiple.ts

Common Use Cases

🎨 UI/UX Review

Analyze screenshots for accessibility issues, design inconsistencies, or improvement suggestions.

📄 Document Processing

Extract information from scanned documents, receipts, or handwritten notes.

🔍 Code Review

Analyze architecture diagrams or flowcharts to understand system design.

📊 Data Extraction

Extract data from charts, graphs, or tables in images.

Next Steps

Parallel Execution

Run multiple agents concurrently

Workflows

Compose sequential and parallel task execution