← All models View on HuggingFace →
Qwen: Qwen3 VL 32B Instruct
qwen/qwen3-vl-32b-instruct
VisionTool useJSONStreaming
Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...
Pricing
Input
$0.104 / 1M
Output
$0.416 / 1M
Specs
Context
262,144 tokens
Input
text, image
Output
text
Released: 2025-10
Supported parameters
max_tokenspresence_penaltyresponse_formatseedtemperaturetool_choicetoolstop_p
Open weights · HuggingFace
1,803,347 downloads/mo
202 likes
apache-2.0 image-text-to-text