read_file tool, and return multimodal content from custom tools.
Built-in context compression is primarily text-oriented. Plan multimodal workloads accordingly: store large media in a backend and pass references when possible.
Multimodal user input
Pass multimodal content in themessages you send to the agent, using the same standard content blocks as LangChain chat models:
Built-in read_file tool
The harness read_file tool returns standard content blocks for supported multimodal files instead of plain text. The agent can inspect images, documents, and media stored in its filesystem when the selected model supports the corresponding modality. Check the provider’s documentation for your model’s supported MIME types.
Supported multimodal file extensions
Supported multimodal file extensions
Custom tool outputs
Custom tools can contain multimodal files, such as images:ToolMessage the model reads on the next turn. Access the normalized representation with content_blocks on the resulting message. For return-type options, serialization behavior, and MCP examples, see Tool return values and Multimodal tool content.
Context compression and multimodal content
Built-in offloading and summarization are optimized for text and message history:- Offloading measures text tokens only. Non-text blocks (including images) are preserved in replacement messages rather than compressed. A message that contains only an image is not offloaded based on image size alone.
-
Summarization compacts older messages into a text-only summary. Image, audio, video, and file blocks in that range are not carried forward—the model only sees what the summarizer writes about them. Recent messages below the keep threshold stay unchanged.
When summarization runs, media blocks in older turns drop out of the active context:
The original conversation is still written to the filesystem as text. See Summarization for triggers, keep thresholds, and the full flow.
- Store images, screenshots, and charts in a filesystem backend or external object store, then pass file paths or URLs through messages.
- Prefer references over base64-encoded image blocks in long-running conversations.
- Use subagents for image-heavy inspection so the main agent receives a compact text result.
- Tune summarization thresholds or provide a custom token counter when your provider charges many tokens for images.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

