Blog

Filters

RSS

Close X

Filter by:

Title

Limit to posts where the title matches:

Date range

Limit to posts where the date of the blog post:

Date range

Min

Max

Audience

Limit to posts where the audience is one or more of:

Current students
- Current undergraduate students
- Current graduate students
Faculty
Media

April 20th, 2018 marked the end of our first iteration of DAWNBench, the first deep learning benchmark and competition that measures end-to-end performance: the time/cost required to achieve a state-of-the-art accuracy level for common deep learning tasks, as well as the latency/cost of inference at this state-of-the-art accuracy level.

(Editor’s note: I was unaware that Kyle Kingsbury was doing a linearizability analysis of Hazelcast when I was writing this post. Kyle’s analysis resulted in Greg Luck, Hazelcast’s CEO, to write a blog post where he cited the PACELC theorem, and came to some of the same conclusions that I came to in writing this post.

In my previous blog post, I discussed the relatively new Apache Arrow project, and compared it with two similar column-oriented storage formats in ORC and Parquet. In particular, I explained how storage formats targeted for main memory have fundamental differences from storage formats targeted for disk-resident data.

Introduction: In 2012, two research papers were published that described the design of geographically replicated, consistent, ACID compliant, transactional database systems. Both papers criticized the proliferation of NoSQL database systems that compromise replication consistency and transactional support, and argue that it is very possible to build extremely scalable, geographically replicated systems without giving up on consistency and transactional support.

EMC announced today that they are acquiring Greenplum. Below are the first thoughts that crossed my mind when I heard about this deal.

Apache Parquet and Apache ORC have become a popular file formats for storing data in the Hadoop ecosystem. Their primary value proposition revolves around their “columnar data representation format”. To quickly explain what this means: many people model their data in a set of two dimensional tables where each row corresponds to an entity, and each column an attribute about that entity. However, storage is one-dimensional --- you can only read data sequentially from memory or disk in one dimension.

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

NoSQL systems such as MongoDB, Cassandra, HBase, DynamoDB, and Riak have made many things easier for application developers. They generally have extremely flexible data models, that reduce the burden of advance prediction of how an application will change over time. They support a wide variety of data types, allow nesting of data, and dynamic addition of new attributes.

I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency (e.g. here, here, and here), due to their ability to store and access column families separately.

Curt Monash has recently been discussing the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.

Blog

Filter by:

DAWNBench v1 Deep Learning Benchmark Results

Hazelcast and the Mythical PA/EC System

An analysis of the strengths and weaknesses of Apache Arrow

Distributed consistency at scale: Spanner vs. Calvin

Quick thoughts on EMC acquiring Greenplum

Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

Replication and the latency-consistency tradeoff

Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item

Distinguishing Two Major Types of Column-Stores

Machine vs. human generated data