Jeff Dozier, Amr El Abbadi
Dissertation Title & Abstract
Towards the Twilight of File-Centricity
File-centricity is a paradigm in which files are the smallest unit of data. File-centricity has two significant advantages: 1) Files package data and thus allow data to be stored and distributed agnostic of their content. 2) Files provide a natural identity and even an identifier (the filename) to data, allowing us to reference and de-reference data. However, file-centricity leaves it to the individual data user to interpret the structure of file contents and align diverse data during extract, transform, and load (ETL) processes.
My thesis is that the content-structure agnostic nature of files causes unnecessary bottlenecks in the flow from data to knowledge in environmental sciences. Unblocking those bottlenecks requires moving data processing paradigms away from file-centricity and towards data-centricity. In my dissertation, I address the "twilight of file-centricity" and technologies required to transition from file-centricity to data-centricity.
Moving towards data-centricity requires replacing files with individual observations as the smallest unit of data. In practical terms, this means storing data in a predefined schema in some form of database. However, this requires 1) the ability to identify data (rather than files), and 2) data to be aligned, meaning attributes and dimensions have to be harmonized across datasets, allowing data comparison and association.
My dissertation presents solutions to these two challenges: 1) With the web service "Open-source Project for a Network Data Access Protocol (OPeNDAP) Citation Creator (OCCUR)", I demonstrate how data queried through OPeNDAP servers can get assigned identities that can be referenced and de-referenced. 2) The Spatio-Temporal Adaptive-Resolution Encoding (STARE) software collection enables data-centric science. The collection contains software to spatiotemporally align data by using the universal spatiotemporal representation STARE. The collection further contains software to perform geospatial analysis and various storage backends. 3) In a science use case, I explore how spatiotemporal alignment of data can help simplify and improve environmental data science and demonstrate how analysis in a data-centric world can be carried out.
Summarizing, this thesis provides solutions to central requirements to move towards data-centricity and into the twilight of files.
Diplom-Ingenieur (BS, MS Mechanical Engineering), Karlsruhe Institute of Technology