

# Steps towards HW/SW-DB-CoDesign

Wolfgang Lehner

### Modern Hardware – All over the place...



Fast Updates on Read-Optimized Databases

of a DC or key-value store that exploits these techniques. Indeed,

it is an instance of a paradigm for how to achieve latch-freedom and

log structuring more generally. In this paper, we describe a new

architecture where the latch-free and log-structure techniques of the

Bw-tree are implemented in a cache/storage subsystem capable of

supporting multiple access methods, in the same way that a

traditional cache/storage subsystem deals with latched access to

fixed size pages that are written back to disks as in-place updates.



of single-node, shared-memory DBMSs is even more important in the many-core era. But if the current DBMS technology does not shared memory adapt to this reality, all this computational power will be wasted on bottlenecks, and the extra cores will be rendered useless.

In this paper, we take a peek at this dire future and examine what happens with transaction processing at one thousand cores. Rather than looking at all possible scalability challenges, we limit our scope to concurrency control. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will become a major bottleneck to scalability, and will likely dwin-

number of cores increases, the problem of concurrency control be-

comes extremely challenging. With hundreds of threads running in

parallel, the complexity of coordinating competing accesses to data

To better understand just how unprepared current DBMSs are for

future CPU architectures, we performed an evaluation of concur-

rency control for on-line transaction processing (OLTP) workloads

on many-core chips. We implemented seven concurrency control

algorithms on a main-memory DBMS and using computer simula-

will likely diminish the gains from increased core counts.

UNIVE

1 INTRODUCTION

#### 1.1 Modern Architectures Modern computer platforms have changed sufficiently that it is

tree, which is known for good performance.

implementation to use LLAMA. The Bw-tree is a B-tree style

index. Layered on LLAMA, it has higher performance and

scalability using real workloads compared with BerkeleyDB's B-



### A Look at Hardware Trends - 2015



System Level



Dresden Database

# The Dresden Agenda ...

#### CULTURE

(inoffical) cultural capital of Germany (theaters, museums, etc.)

### SEMI CONDUCTOR INDUSTRY

- No.1 semiconductor site in Europe
- No.5 semiconductor site on the planet
  - Siltronic wafer production, Toppan / Dupont Photomasks
  - Infineon Global Foundries
  - ZMD, ATMEL, Applied Materials
  - Intel, Amazon etc.

#### LARGE TOP-NOTCH R&D ORGANIZATIONS

- TU Dresden with >38.000 students
- Fraunhofer with 11 institutes (1.200 employees)
- Max Planck with 3 institutes (900 employees)
- Leibnitz Gesellschaft with 4 institutes and 1.500 employees
- Helmholtz Institute, Rossendorf (800 employees)





### The "Dresden Data(base) System Group"



# The cfAED mission



Unique Window of Opportunity for shaping the next big technology waves

#### SOME NUMBERS

- FUNDING VOLUME // € 34 million
- FUNDING PERIOD // 1 Nov 2012 31 Oct 2017
- PARTICIPATING INSTITUTIONS // 11
- INVESTIGATORS // ~ 60





#### ...more shots on a goal



# The "HAEC-Box"



#### **K**EY CHARACTERISTICS

- Optical communication on board
- Adaptive wireless backplane communication
- 3D stacking
- Self-\* capabilities on SW-level



6 10 2 mm-wave processor / memory electro-optics E/O-PCB processor / memory mm-wave , , ,

Multiple layers of memory and processing units



Impact on processor design



Instruction set extensions



React on computer architecture



DB architecture for Scale-up

Dresden Database

Systems Group





# Database-specific Instruction Sets



# xPU Developments and Consequences





http://upload.wikimedia.org/wikipedia/en/c/ce/Clock\_CPU\_Scaling.jpg



TECHNISCH

UNIVERSITÄT

DRESDEN

Appears in the Proceedings of the 38<sup>th</sup> International Symposium on Computer Architecture (ISCA '11) **Dark Silicon and the End of Multicore Scaling** Hadi Esmaeilzadeh' Emily Blem' Renée St. Arnant<sup>a</sup> Karthikeyan Sankaralingam<sup>a</sup> Doug Burger' <sup>1</sup>University of Washington <sup>1</sup>University of Wesconsith Addison <sup>1</sup>The University of Texas at Austin <sup>1</sup>Microsoft Research hadianeh@cs.waschington.edu blem@cs.was.edu karu@cs.wis.edu douger@microsoft.com



#### ABSTRACT

Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core perture, and compiler advances, Moore's Law, coupled with Dennard scaling [11], has resulted in commensurate exponential performance increases. The recent shift to multicore designs has aimed to in-

## Motivation of "DB Processor"

#### TODAY'S DATABASE SERVERS

- Fat cores (area & power)
- Few HW adaptions
- CMOS scaling



#### OUR APPROACH

- HW/SW Co-design
- Customizable processor
- Application-specific ISA extensions
- Tool flow & short HW development cycles



#### DATABASE MACHINES (ANCIENT)

- Processors build from scratch
- Long development cycles
- High development costs



# Currently available HW: The Tomahawk2







INIVERSITÄ

#### Core manager (CM):

- Extended Xtensa-LX4
- Scheduling specific instruction set
- 32KB for code
- 64KB for data

#### **Processing Elements (PEs)**

- Xtensa-LX4 from Tensilica (now Cadence)
- 32KB for code
- 32KB for data

#### Application Core (App)

- 570T core from Tensilica (now Cadence)
- 16KB cache for code
- 16KB cache for data

#### 2 x 128MB DRAM



Memory 1

### Customizable Processor Model







## Overview: Tool Flow







### A selection of database primitives...

### SET OPERATIONS (FOR RID LISTS)

- Intersection
- Difference
- Union

#### SORTING

UNIVERSITÄT

Merge Sort

#### HASH OPERATIONS

- Integer Hashing
- String Hashing
- Hash Table Management

#### COMPRESSION (FOR BITMAP INDEXES)

- Word-Aligned Hybrid (WAH)
- Position List Word Aligned Hybrid (PLWAH)
- COMPressed Adaptive indeX (COMPAX)







### Sorted-Set Intersection



 $\rightarrow$  + 2x 128 bit data busses + explicit load-store + SIMD + ...

Dresden Database

Systems Group

## Selectivity: Intersection







### Throughput [Mib/s]



|             | Processor                  | Partial<br>Load–Store | f[MHz]                                    | Intersection                              | Union          | Difference                                 | Merge–Sort   | Selectivity: 50%      |
|-------------|----------------------------|-----------------------|-------------------------------------------|-------------------------------------------|----------------|--------------------------------------------|--------------|-----------------------|
| e unit      | $108 MINI^{1}$<br>DBA_1LSU | -                     | $\begin{array}{c} 442 \\ 435 \end{array}$ | $\begin{array}{c} 31.3\\ 50.7\end{array}$ | $26.4 \\ 47.7$ | $\begin{array}{c} 35.7\\ 50.4 \end{array}$ | $1.7 \\ 3.2$ | Data bus: 32->128 bit |
| Store       | <b>C</b> DBA_1LSU_EIS      | no                    | 424                                       | 513.4                                     | 665.0          | 658.8                                      | 29.3         |                       |
| d-S         | DBA_2LSU_EIS               | no                    | 410                                       | 693.0                                     | 643.0          | 637.0                                      | 28.3         |                       |
| Load-       | DBA_1LSU_EIS               | yes                   | 424                                       | 859.0                                     | 574.2          | 859.0                                      | 29.3         | Final processor       |
| -<br>-<br>- | ► DBA_2LSU_EIS             | yes                   | 410                                       | 1203.0                                    | 780.4          | 1192.6                                     | 28.3         |                       |
| Т           |                            |                       |                                           | Ļ                                         | Ļ              | ţ                                          | ţ            | _                     |
|             | → DBA_2LSU_E               | IS <i>vs.</i> 10      | 8MINI:                                    | 38x                                       | 30x            | 33x                                        | -            | 17x                   |





 $S\!ORTED\text{-}SET\,INTERSECTION$ 

|                         | INTEL 17-920        | DBA_2LSU_EIS         |                    |
|-------------------------|---------------------|----------------------|--------------------|
| Throughput (elements/s) | 1,100 mio           | 1,203 mio            |                    |
| Clock frequency         | $2.67 \mathrm{GHz}$ |                      | → 7x improvement   |
| Max. TDP                | $130 \mathrm{W}$    | 0.135 W              | → 963x improvement |
| Cores/Threads           | 4/8                 | 1/1                  |                    |
| Feature size            | 45 nm               | $65 \mathrm{nm}$     |                    |
| Area (logic & memory)   | $263 \text{ mm}^2$  | $1.5 \text{ mm}^2$ – | → 175x improvement |



### Timing and Area



|   | Technology        | Processor          | $A_{LOGIC}[mm^2]$ | $A_{MEM}[mm^2]$ | $f_{MAX}[ m MHz]$ | $\mathrm{P}[\mathrm{mW}]$ @ $f_{MAX}$ | Fi | nal processor    |         |
|---|-------------------|--------------------|-------------------|-----------------|-------------------|---------------------------------------|----|------------------|---------|
| • | 65  nm            | 108Mini            | $0.220^{1}$       | -               | $442^{1}$         | $27.4^{1}$                            |    | Part             | Area[%] |
|   |                   | DBA_1LSU           | 0.177             | 0.874           | 435               | 56.6                                  | [  | Basic Core       | 20.5    |
|   |                   | DBA_2LSU           | 0.177             | 0.870           | 429               | 57.1                                  |    | Decoding/Muxing  | 14.4    |
|   |                   | DBA_1LSU_EIS       | 0.523             | 0.874           | 424               | 123.5                                 |    | States           | 14.7    |
|   |                   | DBA_2LSU_EIS       | 0.645             | 0.870           | 410               | 135.1                                 |    | Op: All          | 11.3    |
|   | 28  nm            | DBA_2LSU_EIS       | 0.169             | 0.232           | 500               | 47.0                                  |    | Op: Intersection | 6.8     |
| • | <sup>1</sup> http | ://www.tensilica.  | com/uplo          | ads/pdf         | /108Min           | i ndf                                 |    | Op: Difference   | 9.0     |
|   | noop              | .,, www.censiiiea. | com, apro         | uus, pui,       | , 1001111         | ·····                                 | 1  | Op: Union        | 17.6    |

Relative Area Consumption(DBA\_2LSU\_EIS)

| Part             | Area[%] |
|------------------|---------|
| Basic Core       | 20.5    |
| Decoding/Muxing  | 14.4    |
| States           | 14.7    |
| Op: All          | 11.3    |
| Op: Intersection | 6.8     |
| Op: Difference   | 9.0     |
| Op: Union        | 17.6    |
| Op: Merge-Sort   | 5.7     |
| SUM              | 100     |

## Tomahawk2 Programming









# DB-Principles for "Mega-Core" Machines



### Main Driver: NUMA Awareness





(a) Intel Machine (Detailed).



(b) AMD Machine (Topology View).



(c) SGI Machine (Topology View).

| Intel mach  | ine                                                              |                 | AMD machine                                                                          | SGI machine                        |                 |                |                                                                |                 |
|-------------|------------------------------------------------------------------|-----------------|--------------------------------------------------------------------------------------|------------------------------------|-----------------|----------------|----------------------------------------------------------------|-----------------|
| distance    | $\begin{array}{c} \text{bandwidth} \\ \text{(GB/s)} \end{array}$ | latency<br>(ns) | distance (link width)                                                                | $\frac{\rm bandwidth}{\rm (GB/s)}$ | latency<br>(ns) | distance       | $\begin{array}{c} {\rm bandwidth} \\ {\rm (GB/s)} \end{array}$ | latency<br>(ns) |
| local       | 26.7                                                             | 129             | local                                                                                | 16.4                               | 85              | local          | 36.2                                                           | 81              |
| 1  hop  QPI | 10.7                                                             | 193             | $1 \operatorname{hop} \operatorname{HT}(\operatorname{full} \operatorname{link})$    | 5.8                                | 136             | 2nd processor  | 9.5                                                            | 400             |
|             |                                                                  |                 | 1 hop HT (split, single)                                                             | 4.2                                | 152             | 1 hop NUMALink | 7.5                                                            | 505 - 515       |
|             |                                                                  |                 | 1 hop HT (split,dual)                                                                | 2.9                                | 152             | 2 hop NUMALink | 7.5                                                            | 625 - 635       |
|             |                                                                  |                 | 2 hop HT (split, single)                                                             | 3.7                                | 196             | 3 hop NUMALink | 7.1                                                            | 745 - 755       |
|             |                                                                  |                 | $2 \operatorname{hop} \operatorname{HT} (\operatorname{split}, \operatorname{dual})$ | 1.8                                | 196             | 4 hop NUMALink | 6.5                                                            | 870             |



### TA versus Data-Oriented Architecture (DORA)

UNIVERSITÄT





### ERIS Data Management Core



... an academic playground for modern DB techniques





#### 26

# ERIS Overall Architecture

#### DATA-ORIENTED ARCHITECTURE

Follows MVCC principle

UNIVERSITÄT

- Distribution based on logical partitioning
- Aggressive re-partioning using copy as well as link strategies







# Evaluation: Some MicroBenchmarking



#### LOOKUP/UPSERT THROUGHPUT DEPENDING ON INDEX SIZE



#### AMD Machine

SGI Machine

## Evaluation: Scan Throughput



#### $S\!CAN\ PERFORMANCE$

- SGI Machine
- 488 cores parallel scan
- 8 billion entries in the column store

#### LINK AND MEMORY CONTROLLER ACTIVITY

- AMD Machine
- Scan: 8B Keys
- Lookup: 1B Keys







### Evaluation: L3 Cache Usage

L3 CACHE USAGE - INDEX LOOKUP

#### L3 C ACHE LINE STATE – INDEX LOOKUP

- Percentage of all hits
- 1B keys





ERIS

2,1%

Forward

20,9%



Shared Index

1,9%

Shared

58,4%

ERIS

400

350

300

250

200

150

100

50

0

TECHNISCHE

UNIVERSITÄT

16M

# Key Component: Load Balancer

### **BALANCING STRATEGIES**

Copy Strategy:

copy data within NUMA systems between different sockets

• Link Strategy:

TECHNISCH

UNIVERSITÄT DRESDEN delegate pointer, delay data movment

Throughput [Million/s] -MA1 --- One-Shot --- MA8 Time [s]







# Summary and Conclusion



### Conclusion









DRESDEN concept





#### MODERN HARDWARE DEMANDS SIGNIFICANT RE-THINKING OF DB ARCHITECTURES

- extreme NUMA Systems: "distributed system" with common address space
- customizable processors ("dark silicon"): revival of the database machine?
- communication: optical / wireless & RMDA: blurring the boundaries between scale-out and scale-up
- Non-Volatile RAM: will be a game-changer!
- SLIM AND EXTENSIBLE DATA MANAGEMENT CORE TO
  - efficiently support today's applications requirements (mixed OLTP/OLAP workloads etc.)
  - embrace and exploit capabilities of modern hardware