InferX Beta Serverless GPU Inference Platform, Built for Agent-Native Workloads

Browse published catalog models for agent-native workloads and serverless GPU inference, log in when you want to customize one or deploy it into your own tenant.
Model Intro Tags Action
chatterbox-tts
Chatterbox TTS is a state-of-the-art, open-source text-to-speech system developed by Resemble AI. It supports zero-shot voice cloning, emotion control, and high-quality audio generation, all MIT-licensed and fully production-ready. View on Hugging Face
cogito-v1-preview-llama-3B
Cogito-v1-preview-llama-3B is a high-performance "hybrid reasoning" model released by Deep Cogito. Built on the Llama 3.2 3B architecture View on Hugging Face
cogito-v1-preview-llama-8B
Cogito-v1-preview-llama-8B is an 8-billion parameter hybrid reasoning model released in April 2025 by San Francisco-based startup Deep Cogito. View on Hugging Face
cogito-v1-preview-qwen-14B
Cogito-v1-preview-qwen-14B is a hybrid reasoning model developed by Deep Cogito, a San Francisco-based startup that emerged from stealth in April 2025. It is built on the Qwen 2.5 architecture but heavily modified to include self-reflection and "deep thinking" capabilities similar to OpenAI’s o1 or DeepSeek-R1. View on Hugging Face
cogito-v1-preview-qwen-32B
Cogito-v1-preview-qwen-32B (often referred to as Cogito v1 Preview) is a high-performance hybrid reasoning model developed by Deep Cogito. View on Hugging Face
context-1
Context-1 is a 20B parameter agentic search model trained to retrieve supporting documents for complex, multi-hop queries. It is designed to be used as a retrieval subagent alongside a frontier reasoning model View on Hugging Face
Cydonia-24B-v4.3
general-purpose model optimized for strong reasoning, coding, and chat performance View on Hugging Face
coding reasoning
DeepCoder-14B-Preview
DeepCoder-14B-Preview is a high-performance, open-source code reasoning model View on Hugging Face
DeepCoder-1.5B-Preview
DeepCoder-1.5B-Preview is a lightweight yet powerful code-reasoning model released in April 2025 as part of the DeepCoder series by the Agentica team and Together AI. View on Hugging Face
DeepSeek-Coder-V2-Lite-Instruct
A lightweight coding model designed for efficient code generation and reasoning. View on Hugging Face
reasoning coding
DeepSeek-OCR
DeepSeek-OCR (released in late 2025, with v2 arriving in January 2026) is a specialized multimodal model designed to solve the "token explosion" problem in traditional Document AI. While standard Vision-Language Models (VLMs) often convert a single page into thousands of tokens, DeepSeek-OCR treats OCR as a multimodal compression task, achieving high accuracy with a fraction of the computational cost. View on Hugging Face
DeepSeek-Prover-V2-7B
DeepSeek-Prover-V2-7B is a specialized, open-source language model released in 2025 that focuses on formal theorem proving in Lean 4. View on Hugging Face
DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-14B is a premier reasoning model from the DeepSeek-R1 family, specifically engineered to deliver frontier-level logic within a compact 14.7-billion parameter frame. View on Hugging Face
Devstral-Small-2-24B-Instruct-2512
Devstral is an agentic LLM for software engineering tasks. Devstral Small 2 excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench. View on Hugging Face
Devstral-Small-2505
Devstral-Small-2505 is a specialized, agentic large language model released in May 2025 through a collaboration between Mistral AI and All Hands AI. View on Hugging Face
ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-PT is a high-efficiency Large Language Model (LLM) developed by Baidu, released as part of their ERNIE 4.5 family in late 2025. It is specifically designed to balance high-level reasoning capabilities with low computational costs. View on Hugging Face
FLUX.1-dev
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. View on Hugging Face
FLUX.2-klein-9B
FLUX.2-klein-9B is a high-performance, mid-sized text-to-image model that belongs to the next generation of the FLUX family (developed by Black Forest Labs). View on Hugging Face
gemma-3-12b-it
Gemma-3-12B-IT is a mid-sized, instruction-tuned multimodal model from Google’s Gemma 3 family, View on Hugging Face
gemma-3-1b-it
a compact instruction-tuned model designed for fast and efficient general-purpose tasks View on Hugging Face
low-latency
gemma-3-27b-it
instruction-tuned model designed for strong reasoning, coding, and chat performance View on Hugging Face
reasoning coding chat
gemma-3-4b-it
The Gemma-3-4B-IT (Instruction Tuned) is a mid-sized, multimodal model from Google’s latest open-weights family, released in March 2025. It represents a significant architectural shift from the Gemma 2 series, moving from a text-only focus to a native vision-language (multimodal) design. View on Hugging Face
gemma-3n-E4B-it
Gemma-3n-E4B-it is part of the experimental "N" (Native) series from Google, released in early 2026. This model represents a pivot toward native multimodal reasoning, View on Hugging Face
gemma-4-31B-it
gemma-4-31B-it View on Hugging Face
gemma-4-31B-it-uncensored-heretic
A 31B Gemma-4 model modified for reduced safety restrictions and more open responses View on Hugging Face
reasoning coding chat
gemma-4-31B-it-uncensored-heretic-2GPU
gemma-4-31B-it-uncensored-heretic-2GPU View on Hugging Face
gemma-4-E2B-it
An efficient Gemma 4 model optimized for strong performance with lower resource usage View on Hugging Face
low-latency
gemma-4-E2B-it_sound
gemma-4-E2B-it with sound support View on Hugging Face
gemma-4-E4B-it
gemma-4-E4B-it View on Hugging Face
Glistening-Gem-31B-v1.0
Glistening-Gem-31B-v1.0 View on Hugging Face
GLM-OCR
GLM-OCR is a compact, high-performance multimodal model released in February 2026 by Zhipu AI (Z.ai). It is specifically designed to bridge the gap between traditional OCR (character recognition) and full "Document Understanding" (layout, tables, and reasoning). View on Hugging Face
GLM-Z1-32B-0414
GLM-Z1-32B-0414 is a high-performance, open-source reasoning model with 32 billion parameters, released by the zai-org group View on Hugging Face
gpt-oss-20b
The GPT-OSS-20B (Generative Pre-trained Transformer - Open Source Software) is a significant milestone in the move toward high-performance, transparent large language models. It is part of a broader family of models designed to provide a powerful, open-source alternative to proprietary models like GPT-3 or early GPT-4 iterations. View on Hugging Face
gpt-oss-safeguard-20b
GPT-OSS-Safeguard-20B is an open-weight, safety-focused reasoning model View on Hugging Face
Holo3-35B-A3B
Holo3 is our latest generation of large-scale Vision-Language Models (VLMs) specifically optimized for GUI Agents. View on Hugging Face
Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated
Model tuned for Claude-style reasoning with reduced safety restrictions View on Hugging Face
reasoning
Huihui-Qwen3.5-35B-A3B-abliterated
This is an uncensored version of Qwen/Qwen3.5-35B-A3B created with abliteration View on Hugging Face
Huihui-Qwen3.6-35B-A3B-abliterated
The Huihui-Qwen3.6-35B-A3B-abliterated (released April 19, 2026) is a specialized variant of Alibaba's latest Qwen3.6 MoE mode View on Hugging Face
Inferx-bundle-Qwen3.6-35B-A3B-FP8-Qwen3-Embedding-0.6B-Qwen3-Reranker-0.6B
this is a bundle of Qwen3.6-35B-A3B-FP8, Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B View on Hugging Face
InnerVerse-GLM47Flash-v1
A fast, reasoning-focused model optimized for efficient inference and strong instruction following View on Hugging Face
coding reasoning
IntelliAsk-Qwen3-32B-450-Merged
IntelliAsk-Qwen3-32B-450-Merged View on Hugging Face
InternVL3_5-38B-FP8-Dynamic
InternVL3.5-38B-FP8-Dynamic is a state-of-the-art multimodal large language model (MLLM) optimized for high-efficiency inference View on Hugging Face
InternVL3_5-38B-Instruct
InternVL3.5-38B-Instruct is an advanced multimodal large language model (MLLM) released in late 2025 by Shanghai AI Laboratory. View on Hugging Face
InternVL3_5-8B
InternVL3.5-8B-Instruct is the latest state-of-the-art multimodal large language model (MLLM) from OpenGVLab (Shanghai AI Lab), View on Hugging Face
Kimi-Linear-48B-A3B-Instruct-AWQ-8bit
Kimi-Linear-48B-A3B-Instruct-AWQ-8bit is a high-efficiency, long-context model released by Moonshot AI in late 2025. It represents a significant departure from standard Transformer architectures, specifically designed to eliminate the "quadratic bottleneck" that usually slows down long-context processing. View on Hugging Face
Kimi-VL-A3B-Thinking-2506
Kimi-VL-A3B-Thinking-2506 is a state-of-the-art vision-language model (VLM) released by Moonshot AI in mid-2025. View on Hugging Face
L3.3-70B-Loki-V2.0
Llama-based model tuned for immersive roleplay, storytelling, and strong narrative consistency View on Hugging Face
long-context
Llama-3.1-8B-Instruct
Llama-3.1-8B-Instruct is the lightweight, instruction-tuned variant of Meta’s Llama 3.1 family. View on Hugging Face
Llama-3.3-70B-Instruct-AWQ
Llama-3.3-70B-Instruct-AWQ is the 4-bit quantized version of Meta's December 2024 flagship "efficiency" model. View on Hugging Face
Magistral-Small-2509-AWQ-4bit
Magistral-Small-2509-AWQ-4bit is the 4-bit quantized version of Mistral AI's Magistral Small 1.2 View on Hugging Face
medgemma-27b-text-it-FP8-Dynamic
MedGemma-27B-Text-IT-FP8-Dynamic is an FP8 Dynamic–quantized derivative of Google’s MedGemma-27B-Text-IT model, optimized for high-throughput inference while preserving strong performance on medical and biomedical instruction-tuned text-only tasks. View on Hugging Face
Midnight-Miqu-70B-v1.5-FP8-Dynamic
Midnight-Miqu-70B-v1.5 is a high-performance 70B parameter model specifically engineered for creative writing, long-form roleplay, and complex character interactions. It is a "DARE Linear" merge of Midnight-Miqu-v1.0 and Tess-v1.6, designed to retain the legendary prose quality of the original "Miqu" (the leaked Mistral-70B weights) while improving instruction following and world-state tracking. View on Hugging Face
Ministral-3-14B-Reasoning-2512
The Ministral-3-14B-Reasoning-2512 (often referred to as part of the "Les Ministraux" family) is one of Mistral AI's most sophisticated "mid-weight" models. It is specifically engineered to bridge the gap between low-latency edge computing and the deep reasoning capabilities typically reserved for massive 70B+ parameter models. View on Hugging Face
Ministral-3-8B-Instruct-2512-BF16
Ministral-3-8B-Instruct-2512-BF16 (released in December 2025/January 2026) is the newest "edge-sovereign" multimodal model from Mistral AI. View on Hugging Face
Mistral-Small-24B-Instruct-2501
Mistral-Small-24B-Instruct-2501 (often referred to as Mistral Small 3) is a high-efficiency language model released in late January 2025. View on Hugging Face
Mixtral-8x7B-Instruct-v0.1
Mixtral-8x7B-Instruct-v0.1 is a high-performance Sparse Mixture-of-Experts (SMoE) model released by Mistral AI. View on Hugging Face
Molmo2-4B
Molmo2-4B is a highly efficient, small-scale Vision-Language Model (VLM) View on Hugging Face
Molmo2-8B
multimodal model optimized for image and video understanding with strong grounding and reasoning capabilities View on Hugging Face
multimodal
Moonlight-16B-A3B
Moonlight-16B-A3B is a high-efficiency Mixture-of-Experts (MoE) language model released in February 2025 by Moonshot AI (the creators of Kimi). It was designed to push the "Pareto frontier"—delivering the reasoning power of much larger models while maintaining the inference speed and VRAM footprint of a small model View on Hugging Face
NextCoder-14B
NextCoder-14B is a specialized large language model designed for code editing and modification View on Hugging Face
NextCoder-7B
NextCoder-7B is a specialized, open-weights large language model (LLM) developed by Microsoft Foundry View on Hugging Face
notux-8x7b-v1-AWQ
Notux-8x7b-v1-AWQ is a high-performance, 4-bit quantized version of the Notux-8x7b-v1 model. It combines a state-of-the-art Mixture-of-Experts (MoE) architecture with Activation-aware Weight Quantization (AWQ) for efficient deployment on NVIDIA GPUs. View on Hugging Face
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Optimized with FP8 for fast, efficient reasoning and chat workloads. View on Hugging Face
reasoning high-throughput agentic
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8-temp
test View on Hugging Face
NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 is a cutting-edge multimodal model released by NVIDIA in late 2025. It is specifically engineered for high-throughput, low-latency applications like document intelligence and long-form video understanding. View on Hugging Face
NVIDIA-Nemotron-Nano-9B-v2
NVIDIA-Nemotron-Nano-9B-v2 is a 9-billion-parameter hybrid language model designed for high-efficiency reasoning and agentic workflows. View on Hugging Face
Olmo-3.1-32B-Think-AWQ-4bit
OLMo-3.1-32B-Think-AWQ-4bit is a high-efficiency, reasoning-optimized version of the OLMo 3.1 family View on Hugging Face
OpenEuroLLM-Czech-vLLM-GGUF
a Czech-language model optimized for local inference using the GGUF format View on Hugging Face
OpenThinker3-7B
OpenThinker3-7B is a state-of-the-art open-source reasoning model View on Hugging Face
OpenThinker-Agent-v1
OpenThinker-Agent-v1 is a state-of-the-art, 8-billion parameter open-source model specifically engineered for terminal automation and software engineering tasks. View on Hugging Face
Phi-3.5-vision-instruct
Phi-3.5-vision-instruct is a lightweight, multimodal small language model (SLM) released by Microsoft. View on Hugging Face
Phi-4-mini-reasoning
Phi-4-mini-reasoning is a compact, open-weight reasoning model from Microsoft, designed to bring high-level logical and mathematical "thinking" to small-scale hardware. View on Hugging Face
Qianfan-OCR
Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture. View on Hugging Face
Qwen2.5-14B-Instruct
A 14B instruction-tuned model for strong reasoning, coding, and chat tasks. View on Hugging Face
coding reasoning chat
Qwen2.5-32B-Instruct-AWQ
High-quality instruction-tuned model quantized with AWQ for efficient, lower-memory inference. View on Hugging Face
reasoning coding chat
Qwen2.5-7B-Instruct
Qwen2.5-7B-Instruct is part of Alibaba Cloud’s latest generation of large language models, released as an evolution of the Qwen2 series. View on Hugging Face
Qwen2.5-7B-Instruct-Test
a test model for special image, not for public usage. View on Hugging Face
Qwen2.5-Coder-0.5B
A lightweight coding model for fast, efficient code generation and debugging View on Hugging Face
coding
Qwen2.5-Coder-1.5B-Instruct
A lightweight coding model for fast, low-cost code generation and editing. View on Hugging Face
coding agentic
Qwen2.5-VL-32B-Instruct-AWQ
The Qwen2.5-VL-32B-Instruct-AWQ is a high-performance, vision-language model optimized for efficient inference. It represents a significant step up in complexity and reasoning from the 7B models, sitting in the "heavyweight" class that typically requires multi-GPU setups or advanced quantization to run at interactive speeds. View on Hugging Face
Qwen2.5-VL-7B-Instruct
Qwen2.5-VL-7B-Instruct is the latest iteration of Alibaba Cloud’s vision-language models, released in early 2025. View on Hugging Face
Qwen2-VL-7B-Instruct
The Qwen2-VL-7B-Instruct is a cornerstone model in the second generation of Alibaba's Vision-Language (VL) series. It was released as a major upgrade to the original Qwen-VL View on Hugging Face
Qwen3-14B
A 14B general-purpose model designed for strong reasoning, coding, and chat performance. View on Hugging Face
reasoning coding chat
Qwen3-32B
A general-purpose model designed for strong reasoning, coding, and conversational performance. View on Hugging Face
reasoning chat coding
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a specialized reasoning model released in March 2026. It is built on Alibaba's Qwen3.5-27B architecture and fine-tuned using high-density Chain-of-Thought (CoT) distillation from Anthropic’s Claude 4.6 Opus. View on Hugging Face
Qwen3.5-27B-FP8
Optimized with FP8 quantization for efficient, high-performance inference View on Hugging Face
reasoning coding chat
Qwen3.5-35B-A3B
A model designed for strong reasoning, coding, and agent workloads with improved efficiency. View on Hugging Face
reasoning coding chat
Qwen3.5-35B-A3B-FP8
Optimized with FP8 for efficient, high-performance reasoning and chat workloads View on Hugging Face
high-throughput reasoning coding agentic
Qwen3.5-35B-A3B-GPTQ-Int4
The Qwen3.5-35B-A3B-GPTQ-Int4 is a specialized, highly optimized version of the Qwen3.5 model family. View on Hugging Face
Qwen3.5-4B
Qwen3.5-4B is a compact, natively multimodal model released by Alibaba Cloud in February 2026. View on Hugging Face
Qwen3.5-9B
A 9B general-purpose model designed for strong reasoning, coding, and chat performance View on Hugging Face
reasoning coding chat
Qwen3.5-9B-AWQ
a 9B Qwen3.5 model quantized with AWQ for efficient, low-memory inference. View on Hugging Face
reasoning coding chat
Qwen3.5-9B-NVFP4
A 9B Qwen3.5 model quantized to NVFP4 for ultra-efficient, low-memory inference. View on Hugging Face
reasoning coding chat
Qwen3.6-27B
Qwen3.6-27B View on Hugging Face
Qwen3.6-27B-FP8
Qwen3.6-27B-FP8 View on Hugging Face
Qwen3.6-35B-A3B
Qwen3.6-35B-A3B (released April 14, 2026) is the first open-weight model of the Qwen3.6 series. View on Hugging Face
Qwen3.6-35B-A3B-AWQ
The QuantTrio/Qwen3.6-35B-A3B-AWQ is a high-performance, 4-bit quantized version of the Qwen3.6-35B-A3B model View on Hugging Face
Qwen3.6-35B-A3B-FP8
Qwen3.6-35B-A3B-FP8 (officially released on April 16, 2026) is the first natively quantized FP8 variant of the Qwen3.6 series. View on Hugging Face
Qwen3-ASR-1.7B
The Qwen3-ASR family includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which support language identification and ASR for 52 languages and dialects. Both leverage large-scale speech training data and the strong audio understanding capability of their foundation model, Qwen3-Omni. View on Hugging Face
Qwen3-Coder-30B-A3B-Instruct-1M-GGUF
Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, View on Hugging Face
Qwen3-Coder-30B-A3B-Instruct-FP8
Qwen3-Coder-30B-A3B-Instruct-FP8 is a specialized, high-efficiency model released in late 2025/early 2026. It is designed for agentic coding—tasks where the AI acts as an autonomous developer, interacting with environments and tools View on Hugging Face
Qwen3-Coder-Next-AWQ-4bit
This is a 4-bit AWQ quantized version of Qwen3-Coder-Next, an open-weight language model designed specifically for coding agents and local development. View on Hugging Face
Qwen3-Coder-Next-FP8
Qwen3-Coder-Next-FP8, an open-weight language model designed specifically for coding agents and local development. View on Hugging Face
Qwen3-TTS-12Hz-1.7B-Base
A lightweight text-to-speech model designed for efficient, high-quality speech synthesis View on Hugging Face
audio speech
Qwen3-TTS-12Hz-1.7B-CustomVoice
The Qwen3-TTS-12Hz-1.7B-CustomVoice represents a specialized, highly efficient iteration of Alibaba’s Qwen (Tongyi Qianwen) ecosystem, specifically tuned for Neural Text-to-Speech (TTS). At 1.7 billion parameters, it sits in the "Edge-AI" category—powerful enough to capture human-like prosody and emotion, but small enough to run with extremely low latency on local hardware or mobile devices. View on Hugging Face
Qwen3-TTS-12Hz-1.7B-VoiceDesign
ightweight text-to-speech model designed for customizable voice generation View on Hugging Face
speech
Qwen3-VL-30B-A3B-Instruct
Qwen3-VL-30B-A3B-Instruct is a state-of-the-art multimodal Large Vision-Language Model (LVLM) released by the Qwen team (Alibaba Cloud) in late 2025. View on Hugging Face
Qwen3-VL-32B-Instruct-FP8
Qwen3-VL-32B-Instruct-FP8 represents the pinnacle of mid-sized multimodal intelligence from Alibaba Cloud (Qwen Team), View on Hugging Face
Qwen-Image-2512
Enhanced Huamn Realism Qwen-Image-2512 significantly reduces the “AI-generated” look and substantially enhances overall image realism, especially for human subjects. View on Hugging Face
Qwen/Qwen3-32B-AWQ
a 32B Qwen3 model quantized with AWQ for efficient, high-performance inference View on Hugging Face
reasoning coding chat
Qwopus3.5-27B-v3
Jackrong/Qwopus3.5-27B-v3 is a highly specialized, reasoning-distilled version of the Qwen3.5-27B base model. View on Hugging Face
rnj-1-instruct-AWQ-8bit
rnj-1-instruct-AWQ-8bit is the 8-bit quantized version of Rnj-1 Instruct, an elite 8-billion parameter agentic coding model released by Essential AI in late 2025. View on Hugging Face
RolmOCR
RolmOCR is an open-source, high-performance document OCR model developed by Reducto AI as a lighter and faster alternative to Allen Institute for AI's olmOCR. View on Hugging Face
Seed-OSS-36B-Instruct-AWQ
Seed-OSS-36B-Instruct-AWQ is a 4-bit quantized version of ByteDance’s Seed-OSS-36B, a mid-sized but extremely powerful open-source model released in August 2025. View on Hugging Face
stable-diffusion-3.5-medium
Stable Diffusion 3.5 Medium (SD 3.5 Medium) is a state-of-the-art text-to-image model released by Stability AI View on Hugging Face
Step3-VL-10B
Step3-VL-10B is an open-source multimodal large language model (MLLM) released in January 2026 by StepFun (Stepwise Star). View on Hugging Face
Strand-Rust-Coder-14B-v1
The model fine-tunes Qwen2.5-Coder-14B for Rust-specific programming tasks using a 191K-example synthetic dataset built via multi-model generation and peer-reviewed validation. View on Hugging Face
translategemma-27b-it-FP8-Dynamic
Multilingual translation model optimized with FP8 for fast, memory-efficient inference View on Hugging Face
low-latency
VibeVoice-ASR
A speech recognition model designed for accurate and efficient audio-to-text transcription View on Hugging Face
speech audio low-latency agentic
Z-Image-Turbo
Z-Image-Turbo is a 6-billion parameter text-to-image model released by Alibaba's Tongyi Lab (the team behind Qwen) in late 2025. It was specifically engineered to challenge the dominance of larger models like FLUX.1 by prioritizing extreme inference speed and bilingual text rendering without sacrificing photorealism. View on Hugging Face