Extreme Large Scale Data Management Problems - Entity Resolution

András Benczúr


Business intelligence, e-science and Web mining are rapidly growing sources of extreme large scale data management problems. Scalability issues arise from two main sources. Certain problems such as Web or transactional log processing are data intensive where reading vast amounts of data itself forms a bottleneck. Others such as machine learning or image processing are computational intensive as they require complex algorithms run on large data that may not fit into the internal memory of a single machine. Finally certain problems including network analysis, data segmentation and entity resolution fall into both categories. 

In the presentation, entity resolution will be the main example to illustrate the challenges and emerging opportunities for managing big data. Entity resolution (ER) is a computationally hard data integration problem where database records have to be merged to represent real-world entities. We show simple reductions to communication complexity and data streaming lower bounds to illustrate the difficulties with a distributed implementation. We demonstrate that ER can be solved using algorithms with three different distributed computing paradigms: distributed key-value stores, MapReduce, and Bulk Synchronous Parallel.

Uniform Evaluation of Nonmonotonic Description Logic Programs

Thomas Eiter


Nonmonotonic description logic programs are a major formalism for a loose coupling of rules and ontologies, formalized in logic programming and description logics, respectively. While this approach is attractive for combining systems, the impedance mismatch between different reasoning engines and the API style interfacing are an obstacle to efficient evaluation of dl-programs in general. Uniform evaluation circumvents this by transforming programs into a single formalism, which can be evaluated on a single reasoning engine. In this talk, we consider recent and ongoing work on this approach which uses relational first-order logic (and thus relational database engines) and datalog as target formalisms. Experimental data show that significant performance gains are possible compared to and suggest the potential of this approach.

Logic and Automata-Based Foundations of XML: A Snapshot

Thomas Schwentick


XML query and schema languages have some obvious connections to Formal Language Theory. For example, Document Type Definitions (DTDs) use regular expressions, DTDs can be viewed as grammars and XML Schemas resemble tree automata. Likewise, there are immediate links to Logic, e.g., through the classical characterization of regular tree languages by Monadic Second Order Logic. 
It is therefore not surprising that concepts from Logic and Formal Language Theory played an important role in the development of the theoretical foundations of XML query and schema languages. For example, they helped to analyze the expressiveness of languages, to understand the restrictions posed by the W3C standards, and to develop algorithms for various purposes, including efficient evaluation and static analysis. 
However, methods from Logic and Formal Languages have not merely been applied to XML Theory, the fertilization worked both ways. The specifics of XML posed a lot of new challenges for Logic and Formal Language Theory and triggered various new research lines, e.g., the study of deterministic regular expressions or the development of automata for trees with data values. 
The aim of the talk is to survey the fundamental connections between XML query and schema languages and Logic and Formal Language Theory, to report on recent developments in the area, and to highlight some current directions of research.