InferX Beta Serverless GPU Inference Platform, Built for Agent-Native Workloads

Qwen3.5-9B-NVFP4

A 9B Qwen3.5 model quantized to NVFP4 for ultra-efficient, low-memory inference.
ykarout multimodal text2text reasoning coding chat

ykarout/Qwen3.5-9B-NVFP4 is a 9B Qwen3.5 model optimized using NVIDIA NVFP4 4-bit floating-point quantization, significantly reducing GPU memory usage and improving throughput while maintaining strong reasoning, coding, and chat capabilities, making it ideal for cost-efficient deployments and high-concurrency inference. NVFP4 quantization can reduce memory usage by roughly 4× compared to BF16 while preserving useful numerical properties for LLM inference.

Log in to deploy: this public page shows the catalog model details, but deployment and customization stay behind login.
Log in to deploy

Metadata

Provider
ykarout
Modality
multimodal
API type
text2text
Source
huggingface / ykarout/Qwen3.5-9B-NVFP4
Created
2026-04-03 18:53:31 UTC
Updated
2026-04-20 11:19:21 UTC
Catalog version
5
Visibility
Published

Specifications

Parameters
7.00B
MoE
No
Max model length
65536
Image
inferx/vllm-openai:v0.19.0

Default Deploy Config

GPU count
1
vRAM
50000 MB
Summary
1xGPU 50000 MB

Recommended Use Cases

  • Chatbot
  • Coding assistant

Model Spec