
Gradient
Parallax
Gradient Network launches Parallax, a decentralized inference engine for AI models
Parallax introduces a distributed approach to large language model inference, enabling global collaboration and decentralized compute across heterogeneous devices. Gradient Network positions it as the foundation for a new open, community-driven AI infrastructure.
Georg S. Kuklick
•
November 9, 2025
Gradient Network has unveiled Parallax, a fully distributed inference engine designed to transform how large language models (LLMs) are served and scaled. The system reimagines inference as a global, collaborative process in which models are executed across a mesh of interconnected devices rather than centralized data centers.
Parallax aims to address the growing demand for compute power as AI models expand in size and complexity. By distributing inference tasks among consumer GPUs, Apple Silicon systems, and other edge devices, Parallax reduces reliance on enterprise-grade infrastructure. The company says this approach improves scalability, sovereignty, and cost-effectiveness while lowering barriers to entry for developers and organizations.
The platform introduces three core shifts: intelligence sovereignty, which allows individuals to run advanced models locally without centralized control; composable collaborative inference, enabling shared execution across multiple machines; and latent compute utilization, which turns idle devices into active nodes in a global serving network.
Architecturally, Parallax combines NVIDIA GPU and Apple Silicon support within a unified serving runtime. It employs continuous batching and paged key–value cache management to maximize throughput and concurrency. The system’s communication layer relies on peer-to-peer tensor streaming via a decentralized hash table, while a worker layer coordinates heterogeneous compute through a dual-platform design based on SGLang and a custom MLX-compatible runtime.
In benchmarks against the Petals distributed inference framework, Parallax achieved up to 5.3× lower inter-token latency and 3.1× improvements in overall throughput using the Qwen2.5-72B-Instruct-GPTQ-Int4 model. Performance remained stable across varying input lengths and batch sizes, suggesting scalability potential for real-world workloads.
The company also launched a closed beta chatbot powered by Parallax to demonstrate real-time decentralized inference. Each response is generated by a swarm of participating nodes rather than a centralized server. Gradient Network plans to open-source Parallax after production readiness, integrating it with its existing Lattica communication layer to form what it calls a “fully open, decentralized AI stack.”