Session Abstracts | vLLM Hong Kong Meetup

10:00 - 10:30 Registration

10:30 - 10:40 Welcome Speech

10:40 - 11:10 Getting started with vLLM: The leading open‑source LLM inference engine for Private AI

Whether you are new to vLLM or already have experience working with it, this session will provide both an introduction and a refresher on vLLM. You will learn what vLLM is and why it has emerged as one of the most successful LLM inference and serving engines in the world.

The session will cover the fundamentals of vLLM, along with an overview of the LLM optimization techniques it uses to significantly improve inference performance. Finally, it will highlight the current state of the vLLM community and ecosystem.

By the end of this session, attendees will have a solid foundational understanding of vLLM, preparing them to better grasp more advanced topics in the subsequent meetup sessions.

11:10 - 11:40 Multi-modal inference and deployment using vLLM

Multi-modality is the next frontier for language models, enabling them to see, hear, and interact with our world. But how do we serve these complex models efficiently at scale? This talk examines the unique inference demands of multi-modal large language models (MLLMs) and how vLLM presents a compelling solution. We'll move from theory to practice, showcasing how vLLM's integrated tooling enables developers to benchmark and fine-tune their deployments to meet real-world service level agreements (SLAs). Discover how to turn cutting-edge MLLM research into robust, production-ready applications.

11:40 - 12:10 LLM inference on AMD GPUs: A Technical Deep Dive

This session explores the deployment of LLM inference on AMD GPUs, covering optimization strategies across the stack—from kernel-level techniques to framework integrations.

12:10 - 12:15 Group Photo

12:15 - 13:15 AMD Hands-on Workshop: Minimax M2 Agent Tutorial

A hands-on workshop guiding participants through deploying the Minimax M2 AI agent on AMD GPUs. Cover hardware setup, model configuration, performance optimization, and real-world deployment scenarios. Perfect for developers and ML engineers seeking practical experience with AMD GPU-based AI agent implementations
(The hands-on cloud environment will be provided by AMD, participants are required to bring their own fully charged laptops.）

13:15 - 14:15 Lunch & Networking

14:15 - 14:45 From Offline to Online Inference: Why Serving Is Hard—and How vLLM Helps

As large language models (LLMs) transition from research prototypes to production systems, inference becomes a primary bottleneck, and the gap between “it runs” and “it serves” widens quickly. While offline inference can be optimized for throughput, online inference must handle concurrency, variable-length prompts, and unpredictable workloads, making strong performance much harder to achieve.

In this talk, we begin with a quick introduction to LLM inference, then contrast offline and online settings, highlighting why online serving is challenging. We then compare Ollama and vLLM on a small benchmark of 50 requests sampled from the Alpaca dataset, showing how vLLM can achieve significantly faster runtime. We also explain the core features behind vLLM: PagedAttention and continuous batching, and relate them to hardware behavior such as GPU utilization and memory usage. Finally, we briefly explore additional techniques to accelerate inference in practice, such as prompt engineering and speculative decoding.

14:45 - 15:15 Deep Adaptation and Engineering Practice of vLLM on MetaX GPU

This speech will introduce the in-depth adaptation and optimization practices of the high-performance inference engine vLLM on the MetaX GPU platform. We will thoroughly analyze the implementation path and key breakthroughs, including core technologies such as operator fusion, memory optimization, and computational parallelism strategies, and present the adaptation achievements in mainstream large model scenarios, providing valuable practical references for AI inference deployment on domestic hardware platforms.

15:15 - 15:45 KVCache Practices at MiniMax for Agentic Workloads: From Traffic Characteristics to Architectural Insights

As Large Language Model (LLM) applications evolve toward the Agent paradigm, the challenges of context reuse within inference systems have become increasingly complex. Drawing from MiniMax’s large-scale production experience, this session will provide an in-depth analysis of the distinct differences in temporal locality and request patterns between Agentic workloads and traditional chat traffic. We will explore how these traffic characteristics fundamentally impact KVCache system design and discuss the future evolution of KVCache management paradigms under the dual constraints of limited resources and high-performance requirements.

15:45 - 16:15 vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Previously, vLLM focused on text generation only. vLLM-Omni, just released in December 2025, expands the horizon to omni-modality comprehension and generation, including image, video, audio and .etc. This session will provide an overview of vLLM-Omni, show some basic usage examples, walk through key system design, and discuss the future roadmap. It will provide helpful information for users, developers, and contributors.

16:15 - 17:00 Networking

17:00 Event End