vLLM

High-throughput LLM inference engine with PagedAttention optimization, OpenAI-compatible API, and multi-GPU support for efficient model serving.

AI / Machine LearningFree·14.1M24219d ago

About

vLLM is a state-of-the-art, open-source inference engine developed by the UC Berkeley Sky Computing Lab that specializes in high-throughput, memory-efficient serving of large language models. Unlike traditional LLM serving systems that struggle with memory fragmentation and laten…

Deployment Options

1 stack

You might also like