Candidate:
Ali
Hossein
Abbasi
Abyaneh
Title:
Multi-agent
Learning
for
Cooperative
Scheduling
of
Microsecond-scale
Services
at
Rack
Scale
Date:
January
11,
2022
Time:
12:00
Place:
online
Supervisor(s):
Zahedi,
Seyed
Majid
Abstract:
We
consider
the
load-balancing
problem
in
dense
racks
running
microsecond-scale
services.
In
such
a
system,
the
scheduler
needs
to
make
millions
of
scheduling
decisions
per
second.
Achieving
this
throughput
while
providing
microsecond-scale
tail
latency
and
high
availability
is
extremely
challenging.
To
address
this
challenge,
we
design
a
fully
distributed
load-balancing
framework.
In
this
framework,
servers
cooperatively
balance
the
load
in
the
system.
We
model
the
interactions
among
servers
as
a
cooperative
stochastic
game.
In
this
game,
servers
make
scheduling
decisions
upon
receiving
and
completing
tasks.
We
propose
a
distributed
multi-agent
learning
algorithm
to
find
the
game’s
parametric
Nash
equilibrium.
Our
proposed
algorithm
enables
servers
to
make
scheduling
decisions
in
tens
of
nanoseconds
based
on
(possibly
outdated)
estimates
of
the
load
on
other
servers.
We
implement
and
deploy
our
distributed
load-balancing
algorithm
on
a
rack-scale
computer
with
264
physical
cores.
Our
proposed
solution
provides
up
to
20%
more
throughput
at
low
tail
latency
than
widely
used
load
balancing
policies.