InferX Catalog | Molmo2-8B

Molmo2-8B

multimodal model optimized for image and video understanding with strong grounding and reasoning capabilities

allenai image image2text multimodal

allenai/Molmo2-8B is an 8B multimodal vision-language model from the Allen Institute for AI, built on Qwen3-8B with a SigLIP-2 vision backbone, designed for image, multi-image, and video understanding tasks such as captioning, counting, tracking, and visual grounding, delivering state-of-the-art performance among open-weight models while remaining efficient for production deployments.

Metadata

Provider

allenai

Modality

image

API type

image2text

Source

huggingface / allenai/Molmo2-8B

Created

2026-04-03 01:08:45 UTC

Updated

2026-04-03 20:29:28 UTC

Catalog version

Visibility

Published

Specifications

Parameters

3.00B

MoE

Max model length

10000

Image

vllm/vllm-openai:v0.15.0

Default Deploy Config

GPU count

vRAM

30000 MB

Summary

1xGPU 30000 MB

Recommended Use Cases

Image captioning
Image classification
Image understanding

Model Spec