Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Paddler: Scaling Open-Source LLMs
Learn how Paddler replaces llama-server with a Rust-based infrastructure, offering load-balanced token generation, dynamic model swapping, and production‑ready stability for large scale deployments.
I’m working on a bunch of products that are based on open-source LLMs, and I needed something that would allow me to obtain inference, host the models on my own servers, and also handle the user traffic correctly. And to be able to achieve that, I created and open-sourced a tool (called Paddler). The tool is based on llama.cpp, but only in terms of generating tokens and embeddings. I created Paddler’s infrastructure layer myself to make it distributed and be able to use it in production conditions.
In the demo, I would like to focus on the technical aspects of how I created that infrastructure layer, particularly on how and why I rewrote llama-server to Rust and created my own implementation of llama slots. I needed this to be able to control the traffic correctly and have load balancing and buffering of the requests. Also, with my own infrastructure layer, I have better stability. I’m not dependent on changes in llama.cpp, I’m able to add semantic versioning (which llama-server doesn’t have). I can also add an inference engine other than
llama.cpp in the future, e.g., Candle or vLLM.
Another aspect of the tool is that it lets you swap models dynamically, without the need to restart the setup. This is important when using open-source models because it allows picking the right model for the given task and saves resources in this way.
Paddler: LLMOps platform for self-hosting and scaling `llama.cpp` LLMs.
Paddler: Open-source LLMOps platform enabling self-hosted AI model scaling in your infrastructure.