Archives
Unleashed aims
to
make
petabytes
of
historical
Internet
content
accessible
to
scholars
and
others
interested
in
researching
the
recent
past.
We
are
developing
web
archive
search
and
data
analysis
tools
to
enable
scholars,
librarians
and
archivists
to
access,
share,
and
investigate
recent
history
since
the
early
days
of
the
World
Wide
Web.
Bringing
Order
to
Data develops
tools
and
algorithms
for
mining
rules
and
dependencies
from
large
datasets,
with
applications
to
data
profiling,
knowledge
discovery
and
query
optimization.
We
are
particularly
interested
in
mining
streaming
and
ordered
data,
e.g.,
discovering
order
dependencies
among
columns.
Data
Science
for
Social
Good applies
machine
learning,
graph
mining
and
text
mining
techniques
for
social
good.
Examples
include
smart
meter
data
mining
to
save
energy,
social
network
mining
to
characterize
public
health,
and
educational
data
mining
to
promote
gender
equity
in
science
and
engineering
Data
Stream
Management
• Social
media
streams
and
sensor
measurements
are
continuously
generated
over
time.
Thus,
as
well
as
volume,
we
must
deal
with
data
velocity,
which
motivates
new
techniques
for
real-time
processing
of
streaming
data.
We
are
investigating
new
scheduling
algorithms
for
data-intensive
streaming
tasks
and
new
read
and
write
optimized
data
structures
for
storing
real-time
and
historical
data.
Graphflow is
an
in-memory
graph
database
we
are
building
from
scratch
for
evaluating
both
one-time
and
continuous
queries.
We
study
topics
on
fundamental
components
of
graph
databases
such
as
storage,
query
optimization,
query
processing,
and
triggers,
building
each
component
from
scratch.
gStore
is
an
RDF
graph
database
system
that
employs
a
native
graph
representation.
gStore
employs
the
subgraph
matching-based
query
strategy
as
well
as
a
series
of
query
optimization
techniques
and
structure-aware
index
to
build
an
efficient
graph-native
SPARQL
query
engine.
It
supports
SPARQL
1.1,
the
standard
RDF
query
language.
It
can
be
deployed
on
a
single
machine
or
in
a
scale-out
setting.
HiCAL •
High-Recall
Retrieval
with
Continuous
Active
Learning™
is
an open-source
project
that
facilitates
the
efficient
identification
of
all or
nearly
all
relevant
documents
in
a
corpus. Hi-CAL™
allows
users
to
judge
documents
as
fast
as
possible
with
no
perceptible
interface
lag.
HoloClean
is
a
statistical
inference
engine
to
impute,
clean,
and
enrich
data.
As
a
weakly
supervised
machine
learning
system,
HoloClean
leverages
available
quality
rules,
value
correlations,
reference
data,
and
other
signals
to
build
a
probabilistic
model
that
captures
the
data
generation
process,
and
uses
the
model
in
a
variety
of
data
curation
tasks.
Kùzu
is
an
in-process
property
graph
database
management
system
(GDBMS)
built
for
graph
data
science
workloads.
Kùzu
is
optimized
for
query
speed
and
scalability,
so
aims
to
be
competent
on
complex
join-heavy
analytical
workloads
on
very
large
graph
databases.
We
are
building
Kùzu
as
a
feature-rich
usable
GDBMS
under
a
permissible
license.
In
our
research,
we
design,
implement,
and
do
research
on
each
component
of
the
system.
S-graffito
is
a
Streaming
Graph
Management
System
that
addresses
the
processing
of
OLTP
and
OLAP
queries
on
high
streaming
rate,
very
large
graphs.
These
graphs
are
increasingly
being
deployed
to
capture
relationships
between
entities
(e.g.,
customers
and
catalog
items
in
an
online
retail
environment)
both
for
transactional
processing
and
for
analytics
(e.g.,
recommendation
systems).
Sirius:
A
High-Velocity
Blockchain Distributed
ledgers
such
as
blockchains
are
used
to
store
transactions
in
a
secure
and
verifiable
manner
without
the
need
for
a
trusted
third
party.
In
the
Sirius
project,
we
are
working
on
technologies
to
make
blockchains
more
scalable
and
we
are
investigating
novel
applications
of
high-velocity
blockchains
such
as
transactive
energy
and
clean
transportation
WatDiv
is
a
benchmark
designed
to
measure
how
an
RDF
data
management
system
performs
across
a
wide
spectrum
of
SPARQL
queries
with
varying
structural
characteristics
and
selectivity
classes.
It
is
a
micro
benchmark
to
stress
test
the
performance
of
systems
across
a
wide
variety
of
queries
over
varying
sizes
of
data
sets.