InferX Beta Serverless GPU Inference Platform, Built for Agent-Native Workloads

Molmo2-8B

multimodal model optimized for image and video understanding with strong grounding and reasoning capabilities
allenai image image2text multimodal

allenai/Molmo2-8B is an 8B multimodal vision-language model from the Allen Institute for AI, built on Qwen3-8B with a SigLIP-2 vision backbone, designed for image, multi-image, and video understanding tasks such as captioning, counting, tracking, and visual grounding, delivering state-of-the-art performance among open-weight models while remaining efficient for production deployments.

Log in to deploy: this public page shows the catalog model details, but deployment and customization stay behind login.
Log in to deploy

Metadata

Provider
allenai
Modality
image
API type
image2text
Source
huggingface / allenai/Molmo2-8B
Created
2026-04-03 01:08:45 UTC
Updated
2026-04-03 20:29:28 UTC
Catalog version
2
Visibility
Published

Specifications

Parameters
3.00B
MoE
No
Max model length
10000
Image
vllm/vllm-openai:v0.15.0

Default Deploy Config

GPU count
1
vRAM
30000 MB
Summary
1xGPU 30000 MB

Recommended Use Cases

  • Image captioning
  • Image classification
  • Image understanding

Model Spec