10:00 AM -
10:10 AM |
Opening Remarks
|
10:10 AM -
11:10 AM |
[Keynote]
Building
Performant and Portable Heterogenous
Code using GPU Compute
Accelerators Derek Bouius,
AMD
[Link]
▶
Abstract
This talk will describe the
various
methodologies used to offload
computationally intensive
workloads
from CPUs to accelerators. Key
topics will cover HW
architecture
considerations and programming
methodologies along with
debugging
and profiling techniques.
|
11:10 AM -
11:30 AM |
Break |
|
|
Session 1 |
Session Chair: Daniel Wong |
11:30 AM -
11:50 AM |
[Paper]
Near LLC
Versus Near Main Memory
Processing Hossein Bitalebi,
Vahid Geraeinejad, Masoumeh Ebrahimi (KTH
Royal Institute of Technology)
▶
Abstract
Emerging advanced applications,
such as deep learning and
graphprocessing, with enormous
processing demand and massive
mem-ory requests call for a
comprehensive processing system
or ad-vanced solutions to
address these requirements.
Near data process-ing is one of
the promising structures
targeting this goal.
However,most recent studies
have focused on processing
instructions nearthe main
memory data banks while
ignoring the benefits of
pro-cessing instructions near
other memory hierarchy levels
such as LLC. In this study, we
investigate the near LLC
processing struc-tures, and
compare it to the near main
memory processing alter-native,
specifically in graphical
processing units. We analyze
thesetwo structures on various
applications in terms of
performance andpower. Results
show a clear benefit of near
LLC processing overnear main
memory processing in a class of
applications. Further,we
suggest a structure, which
could benefit from both
structures,requiring the
applications to be
characterized in advance or at
runtime.
|
11:50 AM -
12:10 PM |
[Paper]
Accelerating
Data Transfer between Host and Device
using Idle GPU Yuya Tatsugi,
Akira Nukada (University of Tsukuba)
▶
Abstract
When running single-GPU
applications on multi-GPU
compute nodes, the remaining
GPU devices are kept idle. We
propose a novel technology to
accelerate these single-GPU
applications using the idle GPU
devices. The data transfers
between host and device are
performed not only by the first
GPU but also by the second GPU
as well as the alternative
route with PCI-Express and
NV-Link connected to it. Our
performance evaluations show
the proposed method enables
about twice data transfer speed
as native single GPU case for
large data sizes.
|
12:10 PM -
12:30 PM |
[Invited
Talk] Towards True
Coherent Shared Memory for
Next-generation Multi-GPU
Systems José L. Abellán
(Universidad Católica de Murcia)
▶
Abstract
Multi-GPU (MGPU) systems are
commonly used today to
accelerate a variety of
workloads, including machine
learning applications, graph
applications, and large-scale
simulations. However, the
inefficiencies in terms of NUMA
effects and difficulties in
programming due to the lack of
hardware coherence in these
MGPU systems call for new
architectural solutions for
MGPU systems. To address these
inefficiencies, in this talk I
will present the first proposal
of a true shared main memory
(TSM) for MGPU systems, and
then, to enable seamless
sharing of data in an MGPU
system with TSM (MGPU-TSM), I
will introduce a novel
lightweight scalable
timestamp-based coherence
protocol called MGCC. For
standard benchmarks, an
MGPU-TSM system (with 4 GPUs,
using MGCC) implemented using
the MGPUSim simulator performs
on average, 3.7x and 3.0x
better with relaxed and
sequential consistency,
respectively, than the
non-coherent conventional MGPU
systems with the same number of
GPUs. In addition, compared to
a coherent MGPU system using
the state-of-the-art HMG
coherence protocol, an MGPU
system using MGCC achieves 2.4x
higher performance. Finally, I
will discuss some of the key
challenges and open research
directions to further optimize
an MGPU-TSM system.
|
12:30 PM -
12:50 PM |
[Invited
Talk]
Re-design
GPU NoC and LLC
System
Xia Zhao (Academy
of Military Sciences)
▶
Abstract
To provide high compute
power, GPUs feature an
increased number of SMs
with a larger LLC size and
higher memory bandwidth.
Facing the new
opportunities and
challenges, how to design a
scalable and
high-performance NoC and
LLC system becomes
especially important. In
this talk, I will introduce
our recent work including
hierarchy NoC design for
GPUs, adaptive memory-side
last-level GPU caching, and
selective replication in
memory-side GPU caches.
The hierarchy NoC provides
a scalable and low-cost
interconnect network to
connect the SMs with the
LLCs and memory controllers
by fully exploiting the
unique traffic pattern in
GPUs. To solve the
contention caused by the
concurrent memory accesses
sent to the same shared
data, adaptive LLC can
adaptively choose shared
LLC or private LLC based on
the application
characteristics. Compared
to the coarse grain of data
replication in adaptive
LLC, selective replication
can selectively choose the
replication degree to
reduce data contention
while avoiding the high LLC
miss rate caused by the
duplicated data. All these
designs significantly
increase GPU performance
for data-intensive
applications.
|
12:50 PM -
01:10 PM |
Break |
|
|
Session 2 |
Session Chair: Hoda NaghibiJouybari |
01:10 PM -
01:30 PM |
[Paper]
Systematically
Extending a High-Level CodeGenerator
with Support for Tensor Cores
Lukas Siefke, Bastian Köpcke
(University of Münster), Michel Steuwer
(University of Edinburgh), Sergei Gorlatch
(University of Muenster)
▶
Abstract
High-level code generators like
Halide, Lift, and RISE make a
compelling proposition: write
programs in a simple high-level
language and get
high-performing GPU code “for
free”. They achieve this feat
by restricting the input
language to a specific domain
(such as image and array
processing in Halide) or to a
fixed set of flexible parallel
patterns (as Lift and RISE do).
Implementing high-level code
generators that produce
high-performance code is
challenging, specifically as
the target hardware constantly
evolves.
In this paper, we discuss how
we systematically extend the
RISE high-level code generator
with support for tensor cores,
a specialized hardware feature
of recent Nvidia GPUs. We
highlight the design of RISE
that makes it easily extensible
by following a systematic
bottom-up approach, that first,
exposes the imperative tensor
core API to the code generator,
then, raises the abstractions
to an internal low-level
functional representation,
that, finally, is targeted by a
rewrite process that starts
from a high-level functional
program.
Our experimental evaluation
shows that RISE with support
for tensor cores generates code
of competitive performance to
manually optimized CUDA code,
which is only up to 36%, but on
average only 10%, slower than
Nvidia’s highly optimized
cuBLAS library, and clearly
outperforms any code that does
not exploit tensor cores.
|
01:30 PM -
01:50 PM |
[Paper]
Compiler-Assisted
Scheduling for Multi-Instance
GPUs Chris Porter (Georgia
Institute of Technology), Chao Chen (Amazon
Web Service), Santosh Pande (Georgia
Institute of Technology)
▶
Abstract
NVIDIA's Multi-Instance GPU
(MIG) feature allows users to
partition a GPU's compute and
memory into independent
hardware instances. MIG
guarantees full isolation among
co-executing kernels on the
device, which boosts security
and prevents
interference-related
performance degradation.
Despite the benefits of
isolation, however, certain
workloads do not necessarily
need such guarantees,
and in fact enforcing such
isolation can negatively impact
the throughput of a
group of processes. In this
work we aim to relax the
isolation property for
certain types of jobs, and to
show how this can dramatically
boost throughput
across a mixed workload
consisting of jobs that demand
isolation and others
that do not. The number of MIG
partitions is hardware-limited
but configurable,
and state-of-the-art workload
managers cannot safely take
advantage of unused
and wasted resources inside a
given partition. We show how a
compiler and
runtime system working in
tandem can be used to pack jobs
into partitions when
isolation is not necessary.
Using this technique we improve
overall
utilization of the device while
still reaping the benefits of
MIG's isolation
properties. Our experimental
results on NVIDIA A30s with a
throughput-oriented
workload show an average of
1.45x throughput improvement
and 2.93x increase in
GPU memory utilization over the
Slurm workload manager. The
presented
framework is fully automatic
and requires no changes to user
code. Based on
these results, we believe our
scheme is a practical and
strong advancement over
state-of-the-art techniques
currently employed for MIG.
|
01:50 PM -
02:05 PM |
[Work-in-Progress
Presentation] PTXVM:
Translating PTX to
C Sreepathi Pai, Benjamin
Carleton, Benjamin Valpey, Amr Elhelw
(University of Rochester)
▶
Abstract
We describe our ongoing effort
to translate CUDA PTX kernels
to C. Our
translator, PTXVM, generates
single-threaded C code from
existing PTX
kernels and does not need an
interpreter. PTXVM is
distinguished by
its expansive and faithful
support of NVIDIA's PTX
specification. This
enables it to run many complex
real-life programs such CUB and
ModernGPU, libraries such as
cuRAND, and also benchmarks
such as the
IrGL graph algorithms, Rodinia,
and PolyBench. In this talk,
I'll
describe the architecture of
PTXVM as well as an example
tracing
infrastructure we've built on
top of it to gather execution
statistics.
|
02:05 PM -
02:20 PM |
[Work-in-Progress
Presentation] Understanding
Wafer-Scale GPU Performance using an
Architectural Simulator Chris
Thames, Yifan Sun (William & Mary)
▶
Abstract
Wafer-Scale chips have the
potential to break the die-size
limitation and provide extreme
performance scalability.
Existing solutions have
demonstrated the possibility of
integrating multi-CPU and
multi-GPU systems at a
significantly larger scale on a
wafer. This increased
capability results in an
increase of complexity in
managing the memory and
computing resources. To
facilitate the community study
wafer-scale systems, this paper
develops an architectural
simulator dedicated to model
wafer-scale multi-device
systems. Also, this work
demonstrates analysis of
initial results from
simulations on wafer-scale GPU
systems, providing useful
insight that can guide future
system design.
|
02:20 PM -
02:35 PM |
[Work-in-Progress
Presentation] ScaleServe: A
Scalable Multi-GPU Machine Learning
Inference System and Benchmarking
Suite Ali Jahanshahi, Marcus
Chow, Daniel Wong (UC Riverside)
▶
Abstract
We present, SCALESERVE, a
scalable multi-GPU inference
system for a variety of machine
learning tasks. The proposed
suite is unique in that each
component of SCALESERVE
provides the users with
configuration knobs which can
be fine-tuned based on the
specifications of the
deployment platform to achieve
the maximum performance for the
serving. SCALESERVE also
provides detailed performance
metrics/statistics from
different components of the
server which can be used by
designers to characterize the
bottlenecks of the server.
We evaluate SCALESERVE serving
scalability with several
machine learning tasks
including computer vision and
natural language processing on
an 8-GPU server. We used the
provided statistic by
SCALESERVE to fine-tune the
inference server on our target
platform to achieve maximum
performance. The performance
results for ResNet152 show that
SCALESERVE is able to scale
well on a multi-GPU platform.
|
02:35 PM -
02:40 PM |
Closing Remarks
|