WATERLOO / ENGINEERING
ELECTRICAL AND COMPUTER ENGINEERING
SEMINAR
Dr.
Paolo
Rech
Associate
Professor
Federal
University
of
Rio
Grande
do
Sul
Porto
Alegre,
RS,
Brazil
Reliability Issues in Current and Future Supercomputers
Invited by Associate Professor Sebastian Fischmeister
Abstract:
Modern
Supercomputers
are
composed
of
thousands
of
Graphics
Processing
Units
(GPUs)
that
work
in
parallel.
Titan,
today's
third
fastest
supercomputer
for
open
science,
consists
of
more
than
18,000
GPUs
used
by
scientists
from
various
domains
such
as
astrophysics,
fusion,
climate,
and
combustion.
Due
to
the
large-scale
and
the
long
duration,
these
scientific
applications
may
encounter
interruptions
due
to
system
failures
as
well
as
Silent
Data
Corruptions
(SDCs).
Therefore,
while
the
performance
improvement
achieved
via
the
inherent
parallelism
available
in
GPUs
is
necessary
to
expedite
the
scientific
discovery
process,
it
is
equally
critical
that
applications
are
able
to
cope
with
system
failures
during
their
execution,
without
losing
all
of
the
work.
As
we
will
show
in
the
talk
that
the
newest
GPU
cores
are
sensitive
to
radiation-induced
errors,
including
those
from
the
terrestrial
neutron
radiation
environment.
Experimental
data
obtained
during
three
years
of
radiation
experiments
on
current
GPUs
and
the
analysis
of
Titan
field
data
will
be
presented
and
discussed.
A
detailed
analysis
of
the
causes
and
effects
of
radiation-induced
failures
in
supercomputers
will
be
provided
using
a
wide
set
of
parallel
applications
as
case
studies.
Experimental
data
will
be
used
to
show
the
benefit
of
enabling
ECC
on
GPUs
main
memory
structures
and
compare
its
efficiency
with
duplication
and
Algorithm
Based
Fault
Tolerance
one.
Finally,
novel
code
optimizations
to
reduce
the
time-to-solution
of
specific
parallel
algorithms
are
continuously
implemented.
As
experimentally
demonstrated,
codes
optimizations
increase
the
code
sensitivity
but
may
reduce
the
execution
time
in
a
way
that
increase
the
overall
system
reliability.
Biography:
Paolo
Rech
received
his
master
and
Ph.D.
degrees
from
Padova
University,
Padova,
Italy,
in
2006
and
2009,
respectively.
His
studies
included
radiation
tests
and
the
effect
of
neutrons,
protons,
and
alpha
particles
on
programmable
devices
like
FPGAs
and
Systems
On
Chip.
He
was
a
Post
Doc
at
LIRMM,
Montpellier,
France
from
2010
to
2012,
working
on
radiation
effects
on
electronic
devices
at
high
altitudes.
He
is
currently
an
associate
professor
at
the
Federal
University
of
Rio
Grande
do
Sul,
Porto
Alegre,
RS,
Brazil.
Recently,
he
started
collaborations
with
NVIDIA,
AMD,
and
Los
Alamos
National
Lab
to
evaluate
and
mitigate
the
radiation-induced
effects
in
devices
designed
for
large-scale
HPC
centers.