Overview
Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well‑resourced corporates. By pooling compute from many participants, incentivising their efforts, and preventing any single party from controlling a model’s full weights, we’re creating a genuinely open, collaborative path to frontier‑scale AI. We’re looking for a Senior Platform Engineer with experience in startups, or senior dev‑ops in big tech, who is passionate about ML and can help scale and own our systems infrastructure orchestration and services integration.
Responsibilities
- Multi‑Cloud Infrastructure : Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure‑as‑code (Pulumi / Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
- Distributed Training Systems : Architect fault‑tolerant infrastructure for distributed ML – GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.
- Real‑World Networking : Build systems that simulate and handle real‑world network conditions – bandwidth shaping, latency injection, packet loss – while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, as our training happens on consumer nodes and non‑co‑located infrastructure, not in a datacenter.
What You’ll Bring
Infrastructure‑as‑Code : Production Pulumi / Terraform / CloudFormation managing multi‑cloud deployments, lifecycle orchestration, automated provisioning, self‑healing systems at scale.Python Engineering : Idiomatic async Python with error handling, retry logic, concurrent execution. Asyncio, SSH libraries, cloud SDKs, CLI tools.Container & GPU : Docker, Kubernetes / EKS, GPU workloads, heterogeneous clusters, multi‑GPU optimisation, resource scheduling.Networking : Decentralised topologies and routing, NAT hole punching, P2P multi‑address coordination, traffic shaping, real‑world bandwidth constraints.ML Infrastructure : Distributed training workflows, checkpoint management, data sharding, model versioning, long‑running job operations.Observability & SRE : Monitoring systems (Prometheus / Grafana), logging, SLOs, incident response, bottleneck profiling, performance optimisation.What we’re looking for
Experience in a startup environment with an emphasis on micro‑services orchestration or a big‑tech background.Deep understanding of multi‑cloud infra & distributed training systems.A team player with high attention to detail.A strong passion to work at the intersection of AI and decentralised systems.Backed by Union Square Ventures and other tier‑1 investors, we’re a world‑class, deeply technical team of ML researchers. Pluralis is unapologetically ideological. We view the world as a better place if we can implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations from monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.
Senior level : Mid‑Senior. Employment type : Full‑time. Job function : Engineering and Information Technology.
#J-18808-Ljbffr