Digital Preservation Workbench 1.2.0 GitHub️

Comparing Format Registries

Ways of exploring what formats are where...

Introduction

While PRONOM is rightly considered the 'gold standard' in format identification for digital preservation, other sources of information can also be useful. On this page, we look at ways of comparing the different format registries, to help understand their contents and differences.

At the simplest level, we can compare them based on the number of records and file extensions they contain.

This provides an overview, but doesn't indicate how good the coverage is across registries. For example, given how many entries are in WikiData, does this mean it covers everything in the other registries?

Venn Diagrams

One well-know way to compare things like sets of extensions is to use a Venn diagram, where the overlap of each circle represents the degree of overlap of the given sets.

This works well for up to three sets, as we can see here. The diagram makes it clear that despite its size, the WikiData set of extensions does not totally subsume the other two registries, each of which have entries unique to them.

The only problem is that, in general, Venn diagrams are not able to compare more than three sets at once.

UpSet Plots

In recent years, the invention of the UpSet Plot has provided a new way to explore this kind of problem.

This type of diagram enumerates and ranks all the combinations of sets, and makes it easier to explore them. It can still be quite overwhelming, so you can use this set of controls to control which sets are shown.

You can also add your own set of file extensions, if you like:

If you select one the sets or combinations above, the list of extensions (up to the first one thousand) will be shown below.