Will Formal Preservation Models Require Relative Identity? An exploration of data identity statements

Abstract

The problem of identifying and re–identifying data put the notion of of ”same data” at the very heart of preservation, integration and interoperability, and many other fundamental data curation activities. However, it is also a profoundly challenging notion because the concept of data itself clearly lacks a precise and univocal definition. When science is con- ducted in small communicating groups, with homogeneous data these ambiguities seldom create problems and solutions can be negotiated in casual real-time conversations. However when the data is heterogeneous in encoding, con- tent and management practices, these problems can produce costly inefficiencies and lost opportunities. We consider here the relative identity view which apparently provides the most natural interpretation of common identity statements about digitally–encoded data. We show how this view conflicts with the curatorial and management practice of “data” objects, in terms of their modeling, and common knowledge representation strategies. In what follows we focus on a single class of identity statements about digitally–encoded data: “same data but in a different format”. As a representative example of the use of this kind of statements consider the dataset “Federal Data Center Consolidation Initiative (FDCCI) Data Center Closings 2010-2013”1 , available at Data.gov. Anyone can “Down- load a copy of this dataset in a static format”. The available formats include CSV, RDF, RSS, XLS, and XML. Each of this is presumably an encoding of the “same data”. We explore three approaches to formalization into first order logic and for each we identify distinctive tradeoffs for preservation models. Our analysis further motivates the development of a system that will provide a comprehensive treatment of data concepts.

Details

Creators
Sacchi, Simone; Wickett, Karen M.; Renear, Allen H.
Institutions
Date
Keywords
ischool; toronto; canada; data; identity; scientific equivalence; data curation; digital preservation
Publication Type
paper
License
CC BY-NC-SA 3.0 AT
Direct Download
518825 bytes

View This Publication