Towards the Twilight of File-Centricity
Niklas Greissbaum, PhD Candidate, Bren School
PHD DISSERTATION DEFENSE
Advisor: James Frew
Committee: Jeff Dozier, Amr El Abbadi
This defense will take place in person. Join us in Bren Hall 1424 or watch online using this link and passcode data
File-centricity is a paradigm in which files are the smallest unit of data. File-centricity has two significant advantages: 1) Files package data and thus allow data to be stored and distributed agnostic of their content. 2) Files provide a natural identity and even an identifier (the filename) to data, allowing us to reference and de-reference data. However, file-centricity leaves it to the individual data user to interpret the structure of file contents and align diverse data during extract, transform, and load (ETL) processes.
My thesis is that the content-structure agnostic nature of files causes unnecessary bottlenecks in the flow from data to knowledge in environmental sciences. Unblocking those bottlenecks requires moving data processing paradigms away from file-centricity and towards data-centricity. In my dissertation, I address the "twilight of file-centricity" and technologies required to transition from file-centricity to data-centricity.
Moving towards data-centricity requires replacing files with individual observations as the smallest unit of data. In practical terms, this means storing data in a predefined schema in some form of database. However, this requires 1) the ability to identify data (rather than files), and 2) data to be aligned, meaning attributes and dimensions have to be harmonized across datasets, allowing data comparison and association.
My dissertation presents solutions to these two challenges: 1) With the web service "Open-source Project for a Network Data Access Protocol (OPeNDAP) Citation Creator (OCCUR)", I demonstrate how data queried through OPeNDAP servers can get assigned identities that can be referenced and de-referenced. 2) The Spatio-Temporal Adaptive-Resolution Encoding (STARE) software collection enables data-centric science. The collection contains software to spatiotemporally align data by using the universal spatiotemporal representation STARE. The collection further contains software to perform geospatial analysis and various storage backends. 3) In a science use case, I explore how spatiotemporal alignment of data can help simplify and improve environmental data science and demonstrate how analysis in a data-centric world can be carried out.
Summarizing, this thesis provides solutions to central requirements to move towards data-centricity and into the twilight of files.
Niklas Griessbaum is a PhD candidate studying the broader topic of environmental informatics: the application of information technology to environmental sciences. He is interested in technological solutions that allow gracefully working with increasingly diverse and increasingly large volumes of data. Niklas Griessbaum holds a master's degree in Mechanical Engineering from the Karlsruhe Institute of Technology (KIT) in Karlsruhe, Germany.
Before enrolling in the Bren PhD program, Niklas worked in research and development for Électricité de France (EDF), the European Institute for Energy Research (EIFER), and the KIT as a geospatial analyst, scientific programmer, and data engineer in the fields of domestic energy demand, distributed co-generation, and on market intelligence.
After earning his PhD, Niklas is looking forward to teaching geographic information systems (GIS) at the Bren School, working as a programmer for Bayesics LLC and OPeNDAP, and collaborating with Bren Alumni Sam Collie at Natural Capital Consulting LLC.