Welcome to Tolexty's Blog: Show HN: How We Run 60 Hugging Face Models on 2 GPUs https://ift.tt/rVzBu9g

Show HN: How We Run 60 Hugging Face Models on 2 GPUs Most open-source LLM deployments assume one model per GPU. That works if traffic is steady. In practice, many workloads are long-tail or intermittent, which means GPUs sit idle most of the time. We experimented with a different approach. Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpoint In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed. Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads. Short demo here: https://m.youtube.com/watch?v=IL7mBoRLHZk Live demo to play with: https://ift.tt/WXKzONw If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing. January 31, 2026 at 04:43AM

Welcome to Tolexty's Blog

Show HN: How We Run 60 Hugging Face Models on 2 GPUs https://ift.tt/rVzBu9g

No comments:

Show HN: How We Run 60 Hugging Face Models on 2 GPUs https://ift.tt/rVzBu9g

Translate

Adsense