Publications
Analysing data on publications related to digital preservation
Introduction
As part of the Registries of Good Practice project we are building an index of publications related to digital preservation. That process involves bringing the index data together as an SQLite database.
This page provides an initial proof-of-concept showing how analysis and visualisation tools can be applied to that database, and used to generate visualisations that help us understand and explore the data. It runs in your browser, downloads the database, and generates graphs based on SQL queries.
iPRES Publication Types Over Time
This graph uses a simple SQL query to group and count all the types of publication over time.
SELECT year, type, COUNT(*) as count FROM publications GROUP BY year, type;
The results can then be presented as a stacked bar chart.
iPRES Keywords Over Time
Here we use a more complex query to unpack the array of keywords associated with each paper, and then count how often each keyword is used each year. We then limit the result set to keywords that are used at least a few times in a given year, to keep the size of the data set manageable and focussed.
The data can then be visualised as an ordinal scatterplot, where ongoing usage should show up as horizontal sequences of dots, with the dot size indicating the number of papers using that keyword in that year.
This plot vividly illustrates that publication keywords have been used in different ways over the years.
iPRES Keyword Statistics
First we look at the number of publications for each keyword. The list is very long, so it's not practical to plot all the labels, but you can hover over to see each entry.
We can analyse this a bit more closely by collecting the different terms together, by how often each one is used. So, entries on the left are used one or two times across the whole corpus, and entries on the right are the most commonly used keywords.
This emphasises how the majority of keywords are only used a few times across the corpus.
If plot the number of authors that use a keyword versus the number of times it is used in total, the plot is less dominated by the keywords that are only used once, as these almost all overlap. This lets us look for broader trends.
iPRES Author Network
We can also pull out the authors, and make a graph where the size of each node indicates the number of publications they have been listed as an author of. We represent co-authorship as links between authors, with the width of the line indicating the number of times they have published together.
This requires quite a lot of manipulation of the raw data, but the network can then be displayed using a slightly modified version of a widely-used network visualisation method.
There's also a three-dimensional version of this visualisation you can explore, via this blog post.