Collection Profiles

Using Format Profiles to Compare Collections & Registries

Introduction

Here we look at two different ways of exploring and understanding our digital collections.

We'd like to be able to compare our collections with those of other institutions, so we can find our common ground, but also understand what makes us distinctive.
We'd like to be able to compare our collections with the various format registries and identification tools that are out there. This would help us understand which format information sources have the most potential to illuminate our collections.

This page provides interactive tools allowing you to do both of these things, based on some basic statistical information about your collection in the form of a format profile.

A format profile for a collection simply lists all the formats there are, along with a count of how many distinct files or bitstreams appear to be in that format. For example, this is precisely what the UK National Archives File Profiling Tool (DROID) does. For the formats that PRONOM covers, this works very well, and the resulting profile can be analysed within DROID itself, or by using complementary tools like Freud or Demystify.

However, to compare against a wider range of sources, we need to boil things down to the simplest format signature: file extensions. This lets us combine multiple information sources, with all of the benefits and limitations that implies.

A number of institutions have already made suitable file extension collection profiles available, so you can use those to explore this idea. Note that this analysis discards any extensions that appear to be just numbers or contain spaces, but anything else is OK. If you want to look at the source CSV files, you can find them here.

These profiles have been generously generated and shared on a best-effort basis. They may cover all holdings, or not. They may include the results from peeking inside container/archive formats, or not. It's surprisingly difficult to generate this information, and these profiles should not be considered a complete and accurate reflection of all the different items an institution holds.

Crucially, unlike more formal format registries, collection profiles reflect the endlessly inventive chaos of real people doing real things in the real world. These file extensions cannot be trusted, but there's treasure everywhere.

You can also add your own profile to this page, and analyse it without uploading your data anywhere:

How do I create a suitable Extension Profile in CSV format?

Your file extension collection profile should have at least two columns, one called 'extension' and one called 'count'. Other columns will be ignored. This would look something like:

extension	count
PDF	50
DOCX	202
pdf	12
...	...

Note that extensions that are just numbers, or contain spaces, will be dropped. Note also that the system will attempt to convert the extensions into a canonical lower-case format, i.e. *.pdf, but you can just supply the extension e.g. pdf or .pdf and it should work it out.

Note that the count should be supplied as a plain number, e.g. 1024 rather than formatted like e.g. 1,024.

If you want to include files with no extension, please use the special value (no extension) as the value for the extension. If there is a column with an empty string, it will be interpreted as (no extension).

Finally, note that if the same (canonical) extension appears multiple times, the total file count will be add together all occurrences. e.g. in the example above, you'd end up with:

extension	count
*.pdf	62

If your browser supports it, you can try generating an extension profile of some of your own files with the DigiPres Workbench File System Scanner.

If you have any problems, please get in touch via the contact details on the homepage.

Configuration Options

It usually makes sense to ignore extremely rare file extensions. Often, these are simply errors, but also some collections are so large that dropping some of the 'long tail' of formats helps make the analysis a bit easier. You can see this by changing the value here, and observing how this affects the frequency plot below.

It may be that whoever is generating the format profile is concerned that some personal data may leak out through the file extension, and so extensions are truncated so that there is a limit to how much information they can contain. Generally, this is not needed, but if you know that one of the collections you are interested in has truncated the file extensions, this configuration should be set to match, so that the comparison can be as accurate as possible.

Limit number of extensions included in the analysis, as otherwise the graphs and charts will be overwhelmed.

Select A Collection Profile

First, we need to select 'our' profile, the primary collection profile we want to compare with other collections and registries:

Summary of Your Collection Profile

This graph provides a summary of the selected format profile.

This gives a reasonable overview, but also hides all the interesting details of what's going on in that long tail of other formats.

Comparing Collections

The simple format profile above is particularly ill-suited to comparing one collection with another, so here we explore some alternative visualisations.

First, we need to select a different profile to compare against:

Given this, we can now build up a comparison. We start by going through every file extension that is in either collection, and recording the percentage of the overall collection that represents, in terms of numbers of files. We do this because using percentages means we can compare collections of very different sizes.

These percentages can be plotted directly against each other for each file format, with the vertical position representing the percentage in our primary collection, and the horizontal position representing the percentage in the secondary collection. Similar collections should appear as a diagonal line, with outliers representing where collections differ.

We can use a 'beeswarm' plot to really focus on on the difference in percentages. Here, we calculate the difference between the percentages for each extension, and plot that difference vertically. This means formats that are distinctive of our primary collection appear near the top, and those distinctive of the secondary collection appear at the bottom. Rarer and similar extensions bunch up in the middle.

Comparison Data

The comparison data used to generate the above plots can be viewed and downloaded here:

Take care to note which collection is the primary and which is the secondary.

Tool & Registry Coverage

We can use similar methods to compare our primary collection profile with the available format registries and identification tools. This should help us understand what tools might be able to help analyse our collections.

We look at answering this in two ways. Firstly, what single additional tool or registry should I consider, in order to identify as many files as possible? Secondly, if I used all the available tools and registries, what kind of format coverage might I get?

Adding One Registry

Here, we take your selected collection profile, and work out how much coverage of that set of extensions each registry or tool offers.

As most digital preservation systems and workflows will already include an identification step using PRONOM data (e.g. via DROID, Siegfried or Fido), we start by comparing everything to PRONOM and consider those extensions to be covered. If that doesn't suite you, you can switch it off here:

Then, we get to choose which of the other supported tools and registries to consider. This defaults to 'all of them', but you might want to switch off ones you don't want to use.

Given this starting point, what tool/registry might help understand the largest number of files? We can start by plotting the number of recognised extensions and the corresponding total file count for each one:

The underlying data is shown here, and you can select one of the rows to see what extensions are being matched by each tool.

Using All The Registries

Rather than just using one registry, what if we tried them all? Here, we run the analysis above multiple times, and each time around, we take the registry that provides the greatest improvement in overall coverage.

As a table, we can see what happens at each stage, and how the total number of files without any potential matches drops each time.

Plotting that as a graph, we can see the overall benefit each tool brings.

Unique Extensions

Finally, we can look at the unique extensions: those that are in your collection profile, but do not appear to be in any of the thousands of format records aggregated across all the registries. These don't have a Registry ID or a link to the Format Index, because they do not appear in any of the sources we have.

As this data comes from real collections, many of these will reflect the myriad ways file extensions are used and abused in the wild. Nevertheless, the findings so far seem to show that every reasonably large collection has a significant number of files with genuine format extensions that are not in any registry!

This distribution of formats is important for the wider community to analyse, in order to understand how best to address the format identification problem. So, please get in touch if you are able to share your collections format profiles!

Feedback & Futures

This is a first prototype of this kind of analysis tool, and we are keen to hear your feedback on what works, what doesn't, and what a future version could look like!

It will be launched at iPRES 2024, as part of the Digital Preservation Registries: What We Have & What We Need workshop. But if you see us at the conference you are encouragedto ask us to walk you through using this tool and talk to us about sharing your own format profiles.

You can also get in touch with us directly. See the contact details on the homepage.