Abstract
As the demand for GPU memory from applications such as machine learning continues to grow exponentially, maximizing GPU memory capacity has become increasingly important. Unified Virtual Memory (UVM), which combines host and GPU memory into a unified address space, allows GPUs to utilize more memory than their physical capacity. However, this advantage comes at the cost of significant overheads when accessing host memory. Although existing prefetching techniques help alleviate these overheads, they still encounter challenges when dealing with irregular workloads and dynamic mixed workloads. In this paper, we demonstrate that the regularity of workloads is strongly correlated with the sharing status of UVM memory blocks among the Streaming Multiprocessors (SMs) of GPUs, which in turn impacts the effectiveness of prefetching. In addition, we propose the Sharing Aware preFEtching technique, SAFE, which dynamically adjusts prefetching strategies based on the sharing status of the accessed memory blocks. SAFE efficiently tracks the sharing status of the memory blocks by leveraging unified TLBs (uTLBs) and enforces tailored prefetching configurations for each block. This approach requires no hardware modifications and incurs negligible performance overhead. Our evaluation shows that SAFE achieves up to a 6.5× performance improvement over UVM default prefetcher for workloads with predominantly irregular memory access patterns, with an average improvement of 3.6×.
Keywords
GPU, Unified Virtual Memory, Prefetching.