Live Session
Hall 406 D
Paper
21 Sep
 
14:00
SGT
Session 10: Reinforcement Learning
Add Session to Calendar 2023-09-21 02:00 pm 2023-09-21 03:20 pm Asia/Singapore Session 10: Reinforcement Learning Session 10: Reinforcement Learning is taking place on the RecSys Hub. Https://recsyshub.org
Research

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

View on ACM Digital Library

Kabir Nagrecha (University of California, San Diego), Lingyi Liu (Netflix, Inc.), Pablo Delgado (Netflix, Inc.) and Prasanna Padmanabhan (Netflix, Inc.)

View Paper PDFView Poster
Abstract

Deep learning-based recommendation models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion.

In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into the specific bottlenecks and challenges of the DLRM training pipeline at scale. We study real-world DLRM data processing pipelines taken from our compute cluster to both observe the performance impacts of online ingestion and to identify shortfalls in existing data pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute CPU resources across a DLRM data pipeline to more effectively parallelize data-loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves significantly higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus current state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.

Join the Conversation

Head to Slido and select the paper's assigned session to join the live discussion.

Conference Agenda

View Full Agenda →
No items found.