Design an LLM Serving Infrastructure

System DesignhardCommon

llm-servinggpu-managementauto-scalingdistributed-systems

Reported

8 times

Last seen

2026-03-25

First seen

2025-07-20

Active in

2025, 2026

Description

Design a system to serve large language models at scale. Handle batching, GPU memory management, model versioning, and auto-scaling.

Discuss continuous batching, KV-cache management, and how to handle different model sizes. Cover GPU memory fragmentation and model sharding.

Blind·SDE-3·2026-03-25

Glassdoor·Staff·2025-12-15

Typically appears in: Onsite - System Design

60 min — Design an ML infrastructure or AI-related system. Focus on scalability and reliability.