We are pleased to announce this year's joint EDBT/ICDT keynote speakers:
Authors: Azza Abouzied, Daniel J. Abadi, Avi Silberschatz
HadoopDB began as a research effort in 2008 to transform Hadoop --- a batch-oriented scalable system designed for processing unstructured data --- into a full-fledged parallel database system that can achieve real-time (interactive) query responses across both structured and unstructured data. In 2010 it was commercialized by Hadapt, a start-up that was formed to accelerate the engineering of the HadoopDB ideas, and to harden the codebase for deployment in real-world, mission-critical applications.
In this talk I will give an overview of HadoopDB, and how it combines ideas from the Hadoop and database system communities. I will then describe some research challenges that have emerged as HadoopDB increasingly gets deployed in the real world. Many of these challenges involve loading data into structured storage. Although this loading of data can greatly accelerate query execution times, the upfront cost of this load is antithetical to the Hadoop premise that data need not be organized, cleaned, and pre-processed before being available for query processing. Therefore, we will discuss two approaches to reducing these costs: (1) an invisible loading technique where data is incrementally loaded into structured storage over time, based on users’ patterns of data access and (2) a queue-based locality scheduling technique that, when data had been loaded in a heterogeneous manner across the nodes in a cluster, improves upon Hadoop’s greedy scheduler and more efficiently assigns tasks to nodes that have the data stored locally.
Jan van den Bussche
Universiteit Hasselt, Belgium
This talk presents an overview of our work on databases in DNA performed over the past four years, joint with my student Joris Gillis and postdoc Robert Brijder. Our goal is to better understand, at a theoretical level, the database aspects of DNA computing. The talk will be self-contained and will begin with an introduction to DNA computing. We then introduce a graph-based data model of so-called sticker DNA complexes, suitable for the representation and manipulation of structured data in DNA. We also define DNAQL, a restricted programming language over sticker DNA complexes. DNAQL stands to general DNA computing as the standard relational algebra for relational databases stands to general-purpose conventional computing. We show how DNA program can be statically typechecked. Thus, nonterminating reactions, as well as other things that could go wrong during DNA manipulation, can be avoided. We also investigate the expressive power of DNAQL and show how it compares to the relational algebra.
In this talk, I describe some of the recent developments in the database management area, in particular the NoSQL phenomenon and the hoopla associated with it. The goal is not to do an exhaustive survey of NoSQL systems. The aim is to do a broad brush analysis of what these developments mean - the good and the bad aspects! Based on my more than three decades of database systems work in the research and product arenas, I will outline what are many of the pitfalls to avoid since there is currently a mad rush to develop and adopt a plethora of NoSQL systems in a segment of the IT population, including the research community. In rushing to develop these systems to overcome some of the shortcomings of the relational systems, many good principles of the latter, which go beyond the relational model and the SQL language, have been left by the wayside. Now many of the features that were initially discarded as unnecessary in the NoSQL systems are being brought in, but unfortunately in ad hoc ways. Hopefully, the lessons learnt over three decades with relational and other systems would not go to waste and we wouldn’t let history repeat itself with respect to simple minded approaches leading to enormous pain later on for developers as well as users of the NoSQL systems!
INRIA & ENS-Cachan
We survey recent results about enumerating with constant delay the answers to a query over a database. More precisely, we focus on the case when enumeration can be achieved with a preprocessing running in time linear in the size of the database, followed by an enumeration process outputting the answers one by one with constant time between any consecutive outputs. We survey classes of databases and classes of queries for which this is possible. We also mention related problems such as computing the number of answers or sampling the set of answers.