Molmo2-8B
multimodal model optimized for image and video understanding with strong grounding and reasoning capabilities
allenai/Molmo2-8B is an 8B multimodal vision-language model from the Allen Institute for AI, built on Qwen3-8B with a SigLIP-2 vision backbone, designed for image, multi-image, and video understanding tasks such as captioning, counting, tracking, and visual grounding, delivering state-of-the-art performance among open-weight models while remaining efficient for production deployments.
Metadata
Provider
allenai
Modality
image
API type
image2text
Source
huggingface /
allenai/Molmo2-8B
Created
2026-04-03 01:08:45 UTC
Updated
2026-04-03 20:29:28 UTC
Catalog version
2
Visibility
Published
Specifications
Parameters
3.00B
MoE
No
Max model length
10000
Image
vllm/vllm-openai:v0.15.0
Default Deploy Config
GPU count
1
vRAM
30000 MB
Summary
1xGPU 30000 MB
Recommended Use Cases
- Image captioning
- Image classification
- Image understanding