Digital Preservation Workbench 1.2.0 GitHub️

Publications

Introduction

As part of the Registries of Good Practice project we are building an index of publications related to digital preservation. That process involves bringing the index data together as an SQLite database.

This page provides an initial proof-of-concept showing how analysis and visualisation tools can be applied to that database, and used to generate visualisations that help us understand and explore the data. It runs in your browser, downloads the database, and generates graphs based on SQL queries.

Note that at the current time, the index only contains the first version of the iPRES conference proceedings dataset. There may be some data quality issues due to how the data has been collected, which these visualisations may help to uncover!

iPRES Publication Types Over Time

This graph uses a simple SQL query to group and count all the types of publication over time.

SELECT year, type, COUNT(*) as count FROM publications GROUP BY year, type;

The results can then be presented as a stacked bar chart.

iPRES Keywords Over Time

Here we use a more complex query to unpack the array of keywords associated with each paper, and then count how often each keyword is used each year. We then limit the result set to keywords that are used at least a few times in a given year, to keep the size of the data set manageable and focussed.

The data can then be visualised as an ordinal scatterplot, where ongoing usage should show up as horizontal sequences of dots, with the dot size indicating the number of papers using that keyword in that year.

The text is quite small on this visualisation so you might find it needs a large screen.

This plot vividly illustrates that publication keywords have been used in different ways over the years.

iPRES Keyword Statistics

First we look at the number of publications for each keyword. The list is very long, so it's not practical to plot all the labels, but you can hover over to see each entry.

We can analyse this a bit more closely by collecting the different terms together, by how often each one is used. So, entries on the left are used one or two times across the whole corpus, and entries on the right are the most commonly used keywords.

This emphasises how the majority of keywords are only used a few times across the corpus.

If plot the number of authors that use a keyword versus the number of times it is used in total, the plot is less dominated by the keywords that are only used once, as these almost all overlap. This lets us look for broader trends.

iPRES Author Network

We can also pull out the authors, and make a graph where the size of each node indicates the number of publications they have been listed as an author of. We represent co-authorship as links between authors, with the width of the line indicating the number of times they have published together.

This requires quite a lot of manipulation of the raw data, but the network can then be displayed using a slightly modified version of a widely-used network visualisation method.

There's also a three-dimensional version of this visualisation you can explore, via this blog post.

This isn't easy to interact with if you are on a mobile device. Try using a laptop or desktop computer.
Scroll down and press the Go! button to start the visualisation. When the network is visible, you can hover your mouse pointer over any node inside the dashed border and it will show the author's name and publication count.