Digital Preservation Workbench 1.2.0 GitHub️

Format Diversity Estimation

Estimating the total number of digital formats

This analysis uses data from the various registries to build a Species Accumulation Curve, which is an approach used in ecology to estimate the number of species in a given ecosystem[1]. If we treat each format registry as a random sample of the whole ecosystem of data formats, we can combine the series of samples and count how many new formats are added each time a sample is added to the overall set. Overall, the number of new formats being discovered would be expected to decrease as more samples are added, roughly converging towards an estimate for the total global diversity of digital data formats.

The analysis on this page is fairly complicated, so it might take a few moments for it to run and display.

Method

We start by looking at the overall 'uniqueness' of each registry, by looking at how many extensions are unique to each registry.

We then sort the registries in order of descending overall sample size (e.g. how many distinct extensions each holds), and treat each one as a sample of the total set of data formats. We add each in turn, counting the new extensions at each step, with each sample.

Finally, we fit a curve to this data based on the expected form, and extrapolate to determine an estimate for the total number of data formats.

Because we are reducing all the registry data down to file extensions, the assumptions and caveats outlined here should be kept in mind. In particular, given that multiple distinct formats may use the same file extension, this analysis should be considered as establishing a conservative lower-bound on the number of formats.

Uniqueness

To understand how distinct the holdings of each registry are, we can plot the percentage of unique entries versus the total number of entries. If all registries were truly random samples of the totality of data formats, we would expect a broadly linear trend. In other words, we would expect larger registries to contain a larger percentage of unique entries.

This plot shows that different registries have their own character. For example, as one might expect the GitHub Linguist set has a large number of distinct entries, especially given it's relatively small size. This is likely because it specialises in looking at source code files and related formats likely to be found in GitHub repositories.

It is notable that the fdd and pronom have relatively low percentages of unique extensions. This makes sense as it reflects the way those efforts have tended to prioritise coverage of the most widely used formats.

Species Accumulation Curve

The results of the final species accumulation curve are shown below.

This indicates that a conservative lower-bound on the total number of formats we might expect to come across is around 12,000, but given the variation between the data points and the fitted curve, this is only an approximate answer.

Conclusions

Even allowing for a significant degree of error in the estimation process, it is clear that the total number of formats we wish to understands is well in excess of even the many thousands of records in WikiData.

This has significant consequences for the handling of digital material. These include:

Further work is required to understand:

The work on Using Collection Profiles aims to help explore some of these issues.


  1. Ugland, K.I., Gray, J.S. and Ellingsen, K.E. (2003), The species–accumulation curve and estimation of species richness. Journal of Animal Ecology, 72: 888-897. https://doi.org/10.1046/j.1365-2656.2003.00748.x ↩︎