ByteDance: UI-TARS 7B

bytedance/ui-tars-1.5-7b

VisionStreaming

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement...

Pricing

Input

$0.1 / 1M

Output

$0.2 / 1M

Specs

Context

128,000 tokens

Input

image, text

Output

text

Knowledge cutoff: 2025-01-31

Released: 2025-07

Supported parameters

frequency_penaltylogit_biasmax_tokenspresence_penaltyrepetition_penaltyseedstoptemperaturetop_ktop_p

Open weights · HuggingFace

442,785 downloads/mo

554 likes

apache-2.0 image-text-to-text

arXiv:2501.12326 arXiv:2404.07972 arXiv:2409.08264 arXiv:2401.13919 arXiv:2504.01382 arXiv:2405.14573 arXiv:2410.23218 arXiv:2504.07981

View on HuggingFace →

Use this model →