Citation:
Abstract:
The rapid growth of Deep Learning (DL) has led to increasing demand for DL-as-a-Service. In this paradigm, DL inferences are served on-demand through a serverless cloud provider, which manages the scaling of hardware resources to satisfy dynamic workloads. This is enticing to businesses due to lower infrastructure management costs compared to dedicated on-site hosting. However, current serverless systems suffer from long cold starts where requests are queued until a server can be initialized with the DL model, which is especially problematic due to large DL model sizes. In addition, low-latency demands such as in real-time fraud detection and algorithmic trading cause long inferences in CPU-only systems to violate deadlines. To tackle this, current systems rely on over-provisioning expensive GPU resources to meet low-latency requirements, thus increasing the total cost of ownership for cloud service providers.
In this work, we characterize the cold start problem in GPU-accelerated serverless systems. We then design and evaluate novel solutions based on two main techniques. Namely, we propose remote memory pooling and hierarchical sourcing with locality-aware autoscaling where we exploit underutilized memory and network resources to store and prioritize sourcing the DL model from existing host machines over remote host memory then cloud storage. We demonstrate through simulations that these techniques can perform up to 19.3× and 1.4× speedup in 99th percentile and median end-to-end latencies respectively compared to a baseline. Such speedups enable serverless systems to meet low-latency requirements despite dynamic workloads.