Curt
Monash
has recently
been
discussing the
differences
between
machine-generated
data
and
human-generated
data,
and
trying
to
define
these
terms
on
his
blog.
I
think
this
is
a
good
subject
to
dive
into,
since
I
frequently
use
the
existence
of
machine-generated
data
to
justify
to
myself
why
90%
of
my
research
cycles
are
spent
on
scalability
problems
in
database
systems.
Rather
than
try
to
fit
a
response
as
a
comment
on
his
post,
I
thought
I
would
devote
a
post
to
this
subject
here.
In
short,
the
following
are
the
main
reasons
why
machine-generated
data
is
important:
-
Machines
are
capable
of
producing
data
at
very
high
rates.
In
the
time
it
took
you
to
read
this
sentence,
my
three-year
old
laptop
could
have
produced
the
entire
works
of
Shakespeare.
-
The
human
population
is
not
growing
anywhere
near
as
fast
as
Moore’s
law.
In
the
last
decade,
the
world’s
population
has
increased
by
about
20%.
Meanwhile
transistor
counts
(and
also
hard-disk
capacity
since
it
increases
by
roughly
the
same
rate)
has
increased
by
over
2000%,
If
all
data
was
closely
tied
to
human
actions,
then
the
“Big
Data”
research
area
would
be
a
dying
field,
as
technological
advancements
would
eventually
render
today’s
“Big
Data”
miniscule,
and
there
would
be
no
new
“Big
Data”
to
take
its
place.
(All
this
assumes
that
women
don’t
start
to
routinely
give
birth
to
15
children,
and
nobody
figures
out
how
to
perform
human
cloning
in
a
scalable
fashion).
No
researcher
dreams
of
writing
papers
that
makes
only
a
temporary
impact. With
machine-generated
data,
we
have
the
potential
for
data
generation
to
increase
at
the
same
rate
as
machines
are
getting
faster,
which
means
that
“Big
Data”
today
will
still
be
“Big
Data”
tomorrow
(even
though
the
definition
of
“Big”
will
be
adjusted).
- The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until race-track memory becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans.
Curt attempts to define “machine-generated data” in his post as the following:
Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.
He
then
goes
on
to
include
Web
log
data
(including
user
clickstream
logs),
and
social
media
and
gaming
records
data
as
examples
of
machine-generated
data.
If
you
agree
with
the
three
reasons
listed
above
on
why
machine-generated
data
is
important,
then
there
is
a
problem
with
both
the
above
definition
of
machine-generated
data
and
the
examples.
Clickstream
data
and
social
media/gaming
data
are
fundamentally
different
from
environmental
sensor
data
that
has
no
human
involvement
whatsoever.
Certainly
the
scale
of
clickstream
and
gaming
datasets
is
much
larger
than
the
scale
of
other
human-generated
datasets
such
as
point
of
sale
data
(humans
can
make
clicks
on
the
Internet
or
in
a
computer
game
at
a
much
faster
rate
than
they
can
buy
things,
or
write
things
down).
And
certainly,
for
every
human
click,
there
might
be
5X
more
network
log
data
(as
Monash
writes
about
in
his
post).
But
ultimately,
without
humans
making
clicks,
there
would
be
no
data,
and
as
long
as
the
additional
machine-generated
data
is
linearly
related
to
each
human
action
(e.g.
this
5X
number
remains
relatively
constant
over
time)
then
these
datasets
are
not
always
going
to
be
“Big
Data”,
for
the
reasons
described
in
point
(2)
above.
The
basic
source
of
confusion
here
is
that
click-stream
datasets
and
social
gaming
data
sets
are
some
of
the
biggest
datasets
known
to
exist
(eBay,
Facebook,
and
Yahoo’s
multi-petabyte
clickstream
data
warehouses
are
known
to
be
amongst
the
largest
data
warehouses
in
the
world).
Since
machines
are
well-known
to
have
the
ability
to
produce
data
at
a
faster
rate
than
humans,
it
is
easy
to
fall
into
the
trap
of
thinking
that
these
huge
datasets
are
machine
generated.
However,
these
datasets
are
not
increasing
at
the
same
rate
that
machines
are
getting
faster.
It
might
seem
that
way
since
the
companies
that
broadcast
the
size
of
their
datasets
are
getting
larger
and
gaining
users
a
rapid
pace,
and
these
companies
are
deciding
to
throw
away
less
data,
but
over
the
long
term
the
rate
of
increase
of
these
datasets
must
slow
down
due
to
the
human
limitation.
This
makes
them
less
interesting
for
the
future
of
“Big
Data”
research.
I
don’t
necessarily
have
a
better
way
to
define
machine-generated
data,
but
I’ll
end
this
blog
post
with
my
best
attempt:
Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.
Machine generated “Big Data” is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.
Under
this
definition,
stock
trade
data
(independent
computation
agents),
environmental
sensor
data,
RFID
data,
and
satellite
data
all
fall
under
the
category
of
machine-generated
data.
An
interesting
debate
could
form
over
whether
genomic
sequencing
data
is
machine-generated
or
not.
To
the
extent
that
DNA
and
mRNA
are
being
produced
outside
of
humans,
I
think
it
is
fair
to
put
genomic
sequencing
data
under
the
machine-generated
category
as
well.