GPGPU 2022

The 14th Workshop on General Purpose Processing Using GPU (GPGPU 2022)

April 3, 2022, Online

Zoom Link: https://acm-org.zoom.us/j/96884634475?pwd=d1hCeFc1VFhOMnNjZnk4Z05UQ0ZwZz09

Massively parallel (GPUs and other data-parallel accelerators) devices are delivering more and more computing powers required by modern society. With the growing popularity of massively parallel devices, users demand better performance, programmability, reliability, and security. The goal of this workshop is to provide a forum to discuss massively parallel applications, environments, platforms, and architectures, as well as infrastructures that facilitate related research. This year, we are no longer limited to GPU applications and architectures. We welcome research related to any highly parallel computing accelerators and devices.
Authors are invited to submit original research papers in the general area of massively parallel computing and architectures. Topics include, but are not limited to:

Security for GPU architecture and other accelerators
AR/VR support using GPUs or other accelerators
Heterogeneous systems
Cloud-based GPU computing
Serverless/disaggregated GPU computing
GPU/accelerator virtualization/containerization
GPU applications
GPU performance evaluation/benchmarking
GPU programming languages
Operating system support for GPU execution
GPU compilation techniques
GPU reliability
GPU hardware architecture for graphics and general-purpose applications
Power-constrained GPU techniques
Multi-GPU systems
Network system design for intra- and inter-accelerator communication
Domain-specific accelerators
Research & design tools for GPU development

Workshop Program

All times are in US Eastern time (UTC-4).

10:00 AM - 10:10 AM	Opening Remarks ▶ Recording
10:10 AM - 11:10 AM	[Keynote] Building Performant and Portable Heterogenous Code using GPU Compute Accelerators Derek Bouius, AMD [Link] ▶ Abstract This talk will describe the various methodologies used to offload computationally intensive workloads from CPUs to accelerators. Key topics will cover HW architecture considerations and programming methodologies along with debugging and profiling techniques. ▶ Recording
11:10 AM - 11:30 AM	Break

Session 1	Session Chair: Daniel Wong
11:30 AM - 11:50 AM	[Paper] Near LLC Versus Near Main Memory Processing Hossein Bitalebi, Vahid Geraeinejad, Masoumeh Ebrahimi (KTH Royal Institute of Technology) ▶ Abstract Emerging advanced applications, such as deep learning and graphprocessing, with enormous processing demand and massive mem-ory requests call for a comprehensive processing system or ad-vanced solutions to address these requirements. Near data process-ing is one of the promising structures targeting this goal. However,most recent studies have focused on processing instructions nearthe main memory data banks while ignoring the benefits of pro-cessing instructions near other memory hierarchy levels such as LLC. In this study, we investigate the near LLC processing struc-tures, and compare it to the near main memory processing alter-native, specifically in graphical processing units. We analyze thesetwo structures on various applications in terms of performance andpower. Results show a clear benefit of near LLC processing overnear main memory processing in a class of applications. Further,we suggest a structure, which could benefit from both structures,requiring the applications to be characterized in advance or at runtime. ▶ Recording
11:50 AM - 12:10 PM	[Paper] Accelerating Data Transfer between Host and Device using Idle GPU Yuya Tatsugi, Akira Nukada (University of Tsukuba) ▶ Abstract When running single-GPU applications on multi-GPU compute nodes, the remaining GPU devices are kept idle. We propose a novel technology to accelerate these single-GPU applications using the idle GPU devices. The data transfers between host and device are performed not only by the first GPU but also by the second GPU as well as the alternative route with PCI-Express and NV-Link connected to it. Our performance evaluations show the proposed method enables about twice data transfer speed as native single GPU case for large data sizes. ▶ Recording
12:10 PM - 12:30 PM	[Invited Talk] Towards True Coherent Shared Memory for Next-generation Multi-GPU Systems José L. Abellán (Universidad Católica de Murcia) ▶ Abstract Multi-GPU (MGPU) systems are commonly used today to accelerate a variety of workloads, including machine learning applications, graph applications, and large-scale simulations. However, the inefficiencies in terms of NUMA effects and difficulties in programming due to the lack of hardware coherence in these MGPU systems call for new architectural solutions for MGPU systems. To address these inefficiencies, in this talk I will present the first proposal of a true shared main memory (TSM) for MGPU systems, and then, to enable seamless sharing of data in an MGPU system with TSM (MGPU-TSM), I will introduce a novel lightweight scalable timestamp-based coherence protocol called MGCC. For standard benchmarks, an MGPU-TSM system (with 4 GPUs, using MGCC) implemented using the MGPUSim simulator performs on average, 3.7x and 3.0x better with relaxed and sequential consistency, respectively, than the non-coherent conventional MGPU systems with the same number of GPUs. In addition, compared to a coherent MGPU system using the state-of-the-art HMG coherence protocol, an MGPU system using MGCC achieves 2.4x higher performance. Finally, I will discuss some of the key challenges and open research directions to further optimize an MGPU-TSM system. ▶ Recording
12:30 PM - 12:50 PM	[Invited Talk] Re-design GPU NoC and LLC System Xia Zhao (Academy of Military Sciences) ▶ Abstract To provide high compute power, GPUs feature an increased number of SMs with a larger LLC size and higher memory bandwidth. Facing the new opportunities and challenges, how to design a scalable and high-performance NoC and LLC system becomes especially important. In this talk, I will introduce our recent work including hierarchy NoC design for GPUs, adaptive memory-side last-level GPU caching, and selective replication in memory-side GPU caches. The hierarchy NoC provides a scalable and low-cost interconnect network to connect the SMs with the LLCs and memory controllers by fully exploiting the unique traffic pattern in GPUs. To solve the contention caused by the concurrent memory accesses sent to the same shared data, adaptive LLC can adaptively choose shared LLC or private LLC based on the application characteristics. Compared to the coarse grain of data replication in adaptive LLC, selective replication can selectively choose the replication degree to reduce data contention while avoiding the high LLC miss rate caused by the duplicated data. All these designs significantly increase GPU performance for data-intensive applications. ▶ Recording
12:50 PM - 01:10 PM	Break

Session 2	Session Chair: Hoda NaghibiJouybari
01:10 PM - 01:30 PM	[Paper] Systematically Extending a High-Level CodeGenerator with Support for Tensor Cores Lukas Siefke, Bastian Köpcke (University of Münster), Michel Steuwer (University of Edinburgh), Sergei Gorlatch (University of Muenster) ▶ Abstract High-level code generators like Halide, Lift, and RISE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code “for free”. They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RISE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RISE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RISE that makes it easily extensible by following a systematic bottom-up approach, that first, exposes the imperative tensor core API to the code generator, then, raises the abstractions to an internal low-level functional representation, that, finally, is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RISE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia’s highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores. ▶ Recording
01:30 PM - 01:50 PM	[Paper] Compiler-Assisted Scheduling for Multi-Instance GPUs Chris Porter (Georgia Institute of Technology), Chao Chen (Amazon Web Service), Santosh Pande (Georgia Institute of Technology) ▶ Abstract NVIDIA's Multi-Instance GPU (MIG) feature allows users to partition a GPU's compute and memory into independent hardware instances. MIG guarantees full isolation among co-executing kernels on the device, which boosts security and prevents interference-related performance degradation. Despite the benefits of isolation, however, certain workloads do not necessarily need such guarantees, and in fact enforcing such isolation can negatively impact the throughput of a group of processes. In this work we aim to relax the isolation property for certain types of jobs, and to show how this can dramatically boost throughput across a mixed workload consisting of jobs that demand isolation and others that do not. The number of MIG partitions is hardware-limited but configurable, and state-of-the-art workload managers cannot safely take advantage of unused and wasted resources inside a given partition. We show how a compiler and runtime system working in tandem can be used to pack jobs into partitions when isolation is not necessary. Using this technique we improve overall utilization of the device while still reaping the benefits of MIG's isolation properties. Our experimental results on NVIDIA A30s with a throughput-oriented workload show an average of 1.45x throughput improvement and 2.93x increase in GPU memory utilization over the Slurm workload manager. The presented framework is fully automatic and requires no changes to user code. Based on these results, we believe our scheme is a practical and strong advancement over state-of-the-art techniques currently employed for MIG. ▶ Recording
01:50 PM - 02:05 PM	[Work-in-Progress Presentation] PTXVM: Translating PTX to C Sreepathi Pai, Benjamin Carleton, Benjamin Valpey, Amr Elhelw (University of Rochester) ▶ Abstract We describe our ongoing effort to translate CUDA PTX kernels to C. Our translator, PTXVM, generates single-threaded C code from existing PTX kernels and does not need an interpreter. PTXVM is distinguished by its expansive and faithful support of NVIDIA's PTX specification. This enables it to run many complex real-life programs such CUB and ModernGPU, libraries such as cuRAND, and also benchmarks such as the IrGL graph algorithms, Rodinia, and PolyBench. In this talk, I'll describe the architecture of PTXVM as well as an example tracing infrastructure we've built on top of it to gather execution statistics. ▶ Recording
02:05 PM - 02:20 PM	[Work-in-Progress Presentation] Understanding Wafer-Scale GPU Performance using an Architectural Simulator Chris Thames, Yifan Sun (William & Mary) ▶ Abstract Wafer-Scale chips have the potential to break the die-size limitation and provide extreme performance scalability. Existing solutions have demonstrated the possibility of integrating multi-CPU and multi-GPU systems at a significantly larger scale on a wafer. This increased capability results in an increase of complexity in managing the memory and computing resources. To facilitate the community study wafer-scale systems, this paper develops an architectural simulator dedicated to model wafer-scale multi-device systems. Also, this work demonstrates analysis of initial results from simulations on wafer-scale GPU systems, providing useful insight that can guide future system design. ▶ Recording
02:20 PM - 02:35 PM	[Work-in-Progress Presentation] ScaleServe: A Scalable Multi-GPU Machine Learning Inference System and Benchmarking Suite Ali Jahanshahi, Marcus Chow, Daniel Wong (UC Riverside) ▶ Abstract We present, SCALESERVE, a scalable multi-GPU inference system for a variety of machine learning tasks. The proposed suite is unique in that each component of SCALESERVE provides the users with configuration knobs which can be fine-tuned based on the specifications of the deployment platform to achieve the maximum performance for the serving. SCALESERVE also provides detailed performance metrics/statistics from different components of the server which can be used by designers to characterize the bottlenecks of the server. We evaluate SCALESERVE serving scalability with several machine learning tasks including computer vision and natural language processing on an 8-GPU server. We used the provided statistic by SCALESERVE to fine-tune the inference server on our target platform to achieve maximum performance. The performance results for ResNet152 show that SCALESERVE is able to scale well on a multi-GPU platform. ▶ Recording
02:35 PM - 02:40 PM	Closing Remarks ▶ Recording

Speaker: Derek Bouius, AMD
Title: Building Performant and Portable Heterogenous Code using GPU Compute Accelerators

Abstract: This talk will describe the various methodologies used to offload computationally intensive workloads from CPUs to accelerators. Key topics will cover HW architecture considerations and programming methodologies along with debugging and profiling techniques.

Bio: Derek Bouius has been leading product management of the AMD ROCm open software platform for GPU compute for the past 4 years. This open source initiative is closely tied to enablement of ML and HPC workloads using the newest data center GPU devices. Derek has been involved in the design, development and deployment of security and compute accelerators for over 20 years.

Important Dates

Papers due: ~~January 25, 2022~~ February 8, 2022
Notification: March 15, 2022
Final paper due: April 2, 2022

Submission Guidelines

Full paper submissions must be in PDF format for US letter-size paper. They must not exceed 6 pages (all inclusive) in standard ACM two-column conference format (review mode, with page numbers and both 9 or 10pt can be used). GPGPU also accepts extended abstracts (2 pages including references). Authors can select if they want to reveal their identity in the submission. Templates for ACM format are available for Microsoft Word and LaTeX at: https://drupal.sigplan.org/authorInformation.htm

At least one author must present at the workshop conference. Travel func may be applied through SIGPLAN Professional Activities Committee (PAC). Details are available here

Submission Site: GPGPU 2022

Workshop Organizers


Yifan Sun	Daniel Wong	Hoda NaghibiJouybari	Hongyuan Liu
Co-chair	Co-chair	Co-chair	Publication/Web Chair
William & Mary	UC Riverside	Binghamton University	William & Mary

Please contact the organizers if you have any questions.

Program Committee

Tor M. Aamodt (University of British Columbia)
José L. Abellán (Universidad Católica de Murcia)
Nael Abu-Ghazaleh (University of California, Riverside)
Trinayan Baruah (AMD)
Zhongliang Chen (AMD)
Shi Dong (Cerebras)
Xiang Gong (Qualcomm)
Hyeran Jeon (University of California, Merced)
Adwait Jog (William & Mary)
David Kaeli (Northeastern University)
Onur Kariyan (AMD)
Gunjae Koo (Korea University)
Jiajia Li (William & Mary)
Ashutosh Pattnaik (ARM)
Seunghee Shin (Binghamton University)
Jieming Yin (Lehigh University)
Jishen Zhao (University of California, San Diego)
Huiyang Zhou (North Carolina State University)

History and Impact

David Kaeli (Northeastern) and John Cavazos (Delaware) started this GPGPU workshop series, which was first held in 2007 at Northeastern University. In 2008, the workshop was held with ASPLOS 2008. This trend continued and this GPGPU workshop was held with ASPLOS for the next 6 years. From 2015 to 2018, the GPGPU workshop was co-located with PPoPP. In 2019 and 2020, the GPGPU workshop is co-hosted by Adwait Jog (William & Mary), Onur Kayiran (AMD), and Ashutosh Pattnaik (ARM). The average citation count (as per Google Scholar), for a GPGPU workshop paper is currently 37.5, where there have been 8 influential papers with 100+ citations.

All times are in US Eastern time (UTC-4).

Previous versions of the GPGPU workshop: