vLLM

Name: vLLM
Author: Sky Computing Lab at UC Berkeley

Install with PodWarden Learn how to deploy with PodWarden

High-throughput LLM inference engine with PagedAttention optimization, OpenAI-compatible API, and multi-GPU support for efficient model serving.

AI / Machine LearningFree·14.1M24212d ago2 deploys

#machine-learning #tensor-parallel #gpu #llm #model-serving #high-throughput #serving #transformers #vllm #inference #openai-compatible #quantization

Learn how to self-host

Learn how to deploy with PodWarden

About

vLLM is a state-of-the-art, open-source inference engine developed by the UC Berkeley Sky Computing Lab that specializes in high-throughput, memory-efficient serving of large language models. Unlike traditional LLM serving systems that struggle with memory fragmentation and laten…