In July I finished my Bachelor’s Degree in IT Security at the University of Applied Sciences in St. Poelten. During the studies I did some elective courses, one of which was about Data Analysis using Python, Pandas and Jupyter Notebooks. I found it very interesting to do calculations on different data sets and to visualize them. Towards the end of the Bachelor I had to find a topic for my Bachelor Thesis and as a long time user of OpenPGP I thought it would be interesting to do an analysis of the collection of OpenPGP keys that are available on the keyservers of the SKS keyserver network.
So in June 2019 I fetched a copy of one of the key dumps of the one of the
keyservers (some keyserver publish these copies of their key database so people
who want to join the SKS keyserver network can do an initial import). At that
time the copy of the key database contained 5,499,675 keys and was around 12GB.
Using the hockeypuck keyserver
software I imported the
keys into an PostgreSQL database.
Hockeypuck uses a table called keys
to store the keys and in there the column
doc
stores the OpenPGP keys in JSON format (always with a data field
containing the original unparsed data).
For the thesis I split the analysis in three parts, first looking at the Public Key packets, then analysing the User ID packets and finally studying the Signature Packets. To analyse the respective packets I used SQL to export the data to CSV files and then used the pandas read_csv method to create a dataframe of the values. In a couple of cases I did some parsing before converting to a DataFrame to make the analysis step faster. The parsing was done using the pgpdump python library.
Together with my advisor I decided to submit the thesis for a journal, so we revised and compressed the whole paper and the outcome was now
in the Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications (JoWUA).
I think the work gives some valuable insight in the development of the use of OpenPGP in the last 30 years. Looking at the public key packets we were able to compare the different public key algorithms and for example visualize how DSA was the most used algorithm until around 2010 when it was replaced by RSA. When looking at the less used algorithms a trend towards ECC based crytography is visible.
What we also noticed was an increase of RSA keys with algorithm ID 3 (RSA Sign-Only), which are deprecated. When we took a deeper look at those keys we realized that most of those keys used a specific User ID string in the User ID packets which allowed us to attribute those keys to two software projects both using the Bouncy Castle Java Cryptographic API (resp. the Spongy Castle version for Android). We also stumbled over a tutorial on how to create RSA keys with Bouncycastle which also describes how to create RSA keys with code that produces RSA Sign-Only keys. In one of those projects, this was then fixed.
By looking at the User ID packets we did some statistics about the most used
email providers used by OpenPGP users. One domain stood out, because it is not
the domain of an email provider: tellfinder.com
is a domain used in around
45,000 keys. Tellfinder is a
Big Data analysis software and the UID of all but two of those keys is
TellFinder Page Archiver- Signing Key <support@tellfinder.com>
.
We also looked at the comments used in OpenPGP User ID fields. In 2013 Daniel Kahn Gillmor published a blog post titled OpenPGP User ID Comments considered harmful in which he pointed out that most of the comments in the User ID field of OpenPGP keys are duplicating information that is already present somewhere in the User ID or the key itself. In our dataset 3,133 comments were exactly the same as the name, 3,346 were the same as the domain and 18,246 comments were similar to the local part of the email address
Last but not least we looked at the signature subpackets and the development of some of the preferences (Preferred Symmetric Algorithm, Preferred Hash Algorithm) that are being published using signature packets.
Analysing this huge dataset of cryptographic keys of the last 20 to 30 years was very interesting and I learned a lot about the history of PGP resp. OpenPGP and the evolution of cryptography overall. I think it would be interesting to look at even more properties of OpenPGP keys and I also think it would be valuable for the OpenPGP ecosystem if these kinds analysis could be done regularly. An approach like Tor Metrics could lead to interesting findings and could also help to back decisions regarding future developments of the OpenPGP standard.
debian openpgp