Qwen: Qwen3 VL 32B Instruct

qwen/qwen3-vl-32b-instruct

VisionTool useJSONStreaming

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...

Pricing

Input

$0.104 / 1M

Output

$0.416 / 1M

Specs

Context

262,144 tokens

Input

text, image

Output

text

Released: 2025-10

Supported parameters

max_tokenspresence_penaltyresponse_formatseedtemperaturetool_choicetoolstop_p

Open weights · HuggingFace

1,803,347 downloads/mo

202 likes

apache-2.0 image-text-to-text

arXiv:2505.09388 arXiv:2502.13923 arXiv:2409.12191 arXiv:2308.12966

View on HuggingFace →

Use this model →