Archives Hub and VIAF Name Matching

We have recently been reprocessing the Archives Hub data, transforming it into RDF based Linked Data, and as part of this we have been working on names matching. For Linked Data, creating links to external data sources is key – it is what defines Linked Data and gives the opportunities, potentially, for researchers to explore topics across data sources.

This names matching work has big implications for archives. I have already talked extensively in the Hub Blog about the importance of structured data, which is more effectively machine processable. For archival descriptions, we have a huge opportunity to link to all sorts of useful data sources, and one of the key means to link our data is through personal names. To do this effectively, we need names to be structured, and this is one of the reasons why the Hub practice of structuring names by separating out surname, forename, dates, titles and descriptive information (epithets) is so useful. We do this structuring even though EAD (the recognised XML standard for archives) doesn’t actually allow for it. We took the decision that the advantages would outweigh the disadvantages of a non-standard approach (and we can export the data without this additional markup, so really there is no disadvantage).

We have been working on the matching, using the freely available Open Refine data processing tool with the VIAF reconciliation service developed by Roderick Page. Freely available tools like this are so important for projects like ours, and we’re really grateful that we were able to take advantage of this service.

The matching has generally been very successful. Out of 5,076 names, just over 2,000 were linked from the Hub entry to the VIAF entry, which is a pretty good percentage.

This post provides some perspectives on the nature of the data and the results of the matching work.

Full names and epithets

With a name like ‘Bell, Sir Charles, 1774-1842, knight surgeon’, (you can see his entry in our current Linked Data views at http://data.archiveshub.ac.uk/id/person/ncarules/bellsircharles1774-1842knightsurgeon) there is plenty of information – surname, forename, dates and an epithet to help uniquely identify the individual. However, with this name, a match was not found, despite an entry on VIAF: http://viaf.org/viaf/2619993 (which is why you may not yet see the VIAF link on our Linked Data view). Normally, this type of name would yield a match. The reason it didn’t is that the epithet came through in the data we used for matching.

Screenshot of names matching using Open Refine
Screenshot of names matching using Open Refine

This highlights an issue with the use of epithets within names. It is encouraged in the NCA Rules, and it does help to uniquely identify an individual, but it introduces an additional element in the string that makes it harder to match the data.

Where our process did not manage to get the family name, forename and dates to match with VIAF, we used the ‘label‘ information that we have in our Linked Data. This label information includes the epithet. For example: Nosek, Václav, 1892-1955, Czechoslovak politician. This doesn’t tend to find a match, because of the epithet. With examples like this we can manually check, and in this case there is a VIAF match (http://viaf.org/viaf/23683886). But manual checking is problematic where you have thousands of names.

In 95% of cases we did manage to omit the epithet. But sometimes the epithet was included because we used the label, as stated, or because the markup on the Archives Hub is not always consistent and sometimes the structured names I referred to above are not present in Hub data because the data has come from other systems. (We may have found a way to remove these stray epithets, but it would have taken a good deal more time and effort to achieve).

Bringing together information on an individual

The reference to Sir Charles Bell came from a collection of “Papers of Sir Charles Bell” (http://archiveshub.ac.uk/data/gb96-ms386). In this description his occupation is “surgeon”. In the VIAF description (http://viaf.org/viaf/2619993) he is described as “Scottish painter, draftsman, and engraver”. Ostensibly this doesn’t look like the same person, but looking down the VIAF description, you can see titles such as “The nervous system of the human body” and other works that are clearly written by a scientist. The linking of our description with the VIAF description brings together Sir Charles Bell scientist and Sir Charles Bell painter, a good illustration of how linking provides a better perspective, as the different data sources effectively become joined up.

Pulling sparse sources together

For Francis Campbell Ross Douglas VIAF only has the surname and forename (http://viaf.org/viaf/211588539/), although if you look at the source records you also find “Douglas Of Barloch” to help with identification. This is an example where the Hub record has much more information (http://archiveshub.ac.uk/data/gb097-douglasofbarloch), and therefore creating the link is particularly useful. It shows how archives can help contribute to our knowledge of individuals within the Linked Data space, as they often have little known information, gleaned from the archives themselves.

Hyphenated names

From the Hub description http://archiveshub.ac.uk/data/gb1538-s97 comes the name William Blair-Bell. The name with encoding (slightly simplified) is:

<persname>
<surname>Bell</surname>,
<forename>William Blair-</forename>
(<dates>1871-1936</dates>)
<epithet>British gynaecologist and obstetrician</epithet>
</persname>

This is an example of the application of the NCA Rules, which insist on the last entry element as the main element, so it means the element ‘Bell’ is marked up as the surname. In fact, the matching still works because, with all the elements there, the reconciliation service can still find the right person (http://viaf.org/viaf/14336292/). However, it still concerns me that within the archive sector we have a rule that separates out the surname in this way, as it makes the name non-standard compared to other data sources. It is interesting to note that the name is generally given as Blair-Bell, but the Library of Congress enters the name as Bell, W. Blair (William Blair), 1871-1936 (http://id.loc.gov/authorities/names/no92003069.html), so there is an inconsistency in how different services deal with hyphenated and compound surnames. It could be argued that once we have a match, the different formats matter less, as they are simply alternatives that can be used to identify the individual.

Hub names without structured markup

As stated, in the Hub names are marked up by surname, forename, dates, epithet, titles. However, there are still some entries that are not marked up like this, usually because they were created in proprietary software and exported. An example is Carlyon Bellairs (referenced in http://archiveshub.ac.uk/data/gb097-assoc17). The name is marked up as:

<persname>Bellairs, Carlyon, 1871-1955, RN Commander, politician</persname>

You can see the XML mark up at http://archiveshub.ac.uk/data/gb097-assoc17.xml?hub. We have been working on a script to markup the component parts of these names in the Hub, and we have been able to implement it successfully for several institutions. But it is not easy to do this with non-standard names (i.e. not in the surname, forename, dates, epithet format). We do have some instances of names such as the British Prime Minister, James Callaghan, or the author Rudyard Kipling, that are not yet marked up in this way. These individuals should be easy to match, but without the structure within the index term, it is harder for us to ensure that we can get just the name and dates from an unstructured name to match with VIAF.

It is also impossible to implement structured markup on a name where there is a compound surname entered according to NCA Rules – we simply cannot mark these names up correctly because we have no way of knowing whether part of the forename is actually part of the surname. For example, if we have the name “George, David Lloyd” we can’t write a script that can transform this into “Lloyd George, David” because most of the time a name like this will be two forenames and one surname.

The importance of life dates and the use of ‘Is Like’

If we don’t have life dates, it makes matching with certainty almost impossible. Of course, cataloguers can’t always find life dates for a person, but it is worth stressing that the need for life dates has become even more important in recent years, now we have the potential to process data in so many ways. An example is at http://archiveshub.ac.uk/data/gb532-bel – Joyce Margaret Bellamy, a Senior Research Officer at the University of Hull. As we don’t have a birth date, we did not get a match with her VIAF entry at http://viaf.org/viaf/94773174. If we have this kind of entry, without life dates, we could potentially decide to use a different status from an exact match (which usually uses the owl:sameAs property), and for example, we could use the ‘isLike‘ property from the Umbel vocabulary instead. This would be useful where we believe the two names to be referring to the same person, but this type of matching has to be done manually (although potentially we could run something where a name match without a date match was always an ‘isLike’). In the process of checking the 2,000 matches for our data we did enter a number of matches manually, and the whole process of checking took around 5 hours. Not too bad for 2,000 names, and with some time also given to thinking about the results (and making notes for this post!). But if we were to work on the entire Archives Hub data, we couldn’t undertake to do this kind of manual work unless we just had a few thousand ‘not sure’ names that we might be prepared to work through.

Matches without life dates

We do get matches to VIAF where we don’t have dates. We got a match for ‘Hilda Chamberlain’ with VIAF entry http://viaf.org/viaf/286538995/. This seems to be correct, as she is the daughter of Joseph Chamberlain, so we kept the match. But we had to check it manually. Another example is Hercules Ross – http://viaf.org/viaf/21209582/ – matched to the name in description http://archiveshub.ac.uk/data/gb254-ms17. But in this case we don’t really have enough evidence to identify the individual, even though the surname and forename match. The source of the name on VIAF is “Guild, J. Proceedings before the sheriff depute of Forfarshire … against Hercules Ross and David Scott, Esquires, 1809”, but the title deeds described in the Archives Hub cover the sixteenth to the nineteenth century!

With a name like Gustav Wilhelm Wolff (http://archiveshub.ac.uk/data/gb738-ms174), again we only have the name and not the life dates. The match given is for someone born in 1811 (http://viaf.org/viaf/8221966/), and the papers relate to Victorian Jews in Britain. This makes the match likely, but we can’t be sure without dates, so we could potentially enter an ‘is like’, to imply that they are the same person, but that we cannot be certain.

Floruit!

We had a number of individuals without known life dates where the cataloguer used a ‘floruit’, e.g. Sharman W. fl 1884 (Secretary of National Association for the Repeal of the Blasphemy Laws). This sort of entry, whilst it may be the total of the information the archivist has, is difficult to use to identify someone in order to match them. However, the majority of individuals with this kind of entry are not likely to be on VIAF simply because a floruit normally indicates someone for whom life dates cannot be found. It would be interesting to consider a tool that matches floruit dates to possible life dates (e.g. fl 1900-1910 would match to life dates of 1880-1945) but I’m not  sure how much it would add much to the accuracy of a match.

Alternative names

The reconciliation service often works where VIAF provides names that are not ‘the same’ as our name. So, for example, the Hub data may have the name ‘Orton, John Kingsley, 1933-1967’. This was linked to Joe Orton (http://viaf.org/viaf/22163951), and within the VIAF data you can see that Joe Orton is also known as John Kingsley Orton.

Fame does not always give identity

Sometimes very famous people prove problematic, and an example is someone like Queen Victoria, because the name doesn’t include a surname and people tend to enter it in various ways. There were a few examples of this type of thing in our data, although most royal names matched with no problem. It always helps if it is easier to structure a name, but kings, queens, popes, etc. are non-standard.

Some Hub names are quite fulsome, such as “Edward Albert Christian George Andrew Patrick David, 1894-1972, Duke of Windsor, formerly Edward VIII, King of Great Britain and Ireland”. This should link to VIAF http://viaf.org/viaf/47553571 (Windsor, Edward, Duke of, 1894-1972), but the match was not given due to the lack of similarity.

Accented characters may cause problems

We didn’t get a match on Jeremy Bentham, despite having the full structured name, but this may be because the VIAF match has an accent: http://viaf.org/viaf/59078842/. We could possibly have stripped out accents in our data, but in this case the accent was in the VIAF data.  I only found one example where this was a problem, but clearly many names do contain accented characters.

Matches sometimes surprise…

A particularly nice match came up for “Mary-Teresa Craigie Pearl 1867-1906 novelist, dramatist and journalist as John Oliver Hobbes nee Richards”. A complex string, but the algorithm matched the basic elements that we provided (Cragie Pearl, Mary-Teresa, 1867-1906) to the name ‘John Oliver Hobbes’ on VIAF.

Mismatches

Leonard Wright, a Leiutenant (http://archiveshub.ac.uk/data/gb99-kclmawrightlw) matched to Clara Colby (http://viaf.org/viaf/63445035/), also known as Mrs Leonard Wright Colby. Here is an example of an incorrect match due to the same name, but in VIAF the person is a ‘Mrs’ (due to the old fashioned practice of using the husband’s name). The reason for the match seems to be that the name on the Hub includes a floruit (Leonard Wright, fl 1916) which matches the death date of Mrs Leonard Wright (Leonard Wright, Mrs, d 1916).

On the Hub we have an example of an archive that includes “a letter from Charlotte Bronte to Elizabeth Firth”, and the name is simply given as Elizabeth Firth in the index. The match to VIAF was for Mrs J.F.B Firth (http://viaf.org/viaf/71217693/). In this case the match is wrong, as we can see from the Hub description that Elizabeth Firth is actually “Mrs. James Clarke Franks”, and the dates within the additional information don’t seem to match.

There were very few examples of this type of mismatch, but it shows why well structured data, with life dates, helps to minimize any incorrect matches.

Incorrect Suggestions

In the names that did not find definite matches (i.e outside of the 2,000 matches), there were a few examples of suggested names that did not bear much resemblance to the text provided. One example of this was for “Bell, Vanessa, 1879-1961”. The suggestions for ‘sameAs’ names to link to this individual were Stephen, Julia Prinsep British model, 1846-1895; Woolf, Virginia, 1882-1941; Stephen, Leslie, 1832-1904. In fact, VIAF does have Vanessa Bell (http://viaf.org/viaf/7399364), and the link appears to be that the names are related within VIAF (i.e. VIAF establishes that there is an association between these people). However, these were only suggestions, they were not given as matches.

Conclusions

If there was no match given, but we can see that the name and dates have gone to VIAF, then we would assume there simply is no match and VIAF does not have anyone with our surname, forename and dates. But if we can see an epithet has also been included in the data we have provided, then there may well be a match because the epithet can be problematic for finding a match. Our intention would be to continue to improve our filtering to try to remove all epithets, but if the names are not properly structured this can be difficult.

When actually checking data like this, one thing that really comes to the fore is the risk of a ‘sameAs’ where the individual is not the same, and this is a particular risk where you are dealing with a notorious character – maybe a criminal. A number of war criminals are referred to in the Hub data, and it would be very unwise to link these to the wrong person – this is why it is best to only provide matches where the life dates match, but it is not impossible to have the same name with the same life dates of course.

In conclusion I would say that wherever our names have life dates, and these can be successfully carried over to the matching process, the likelihood of a correct match is 99%, but there is always a risk of a mismatch. Clearly the main problem would lie with two people sharing a name and life dates, and the chances of this happening will increase if we only have birth or death date.

Jisc Linking Lives project at Mimas: Jane Stevenson, Adrian Stevenson, Lee Baylis

Whose Data Is It?: a Linked Data perspective

A comment on the blog post announcing the release of the Hub Linked Data maybe sums up what many archivists will think: “the main thing that struck me is that the data is very much for someone else (like a developer) rather than for an archivist. It is both ‘our data’ and not our data at the same time.”

Interfaces to the data

Archives Hub search interface

In many ways, Linked Data provides the same advantages as other machine based ways into the data. It gives you the ability to access data in a more unfiltered way. If you think about a standard Web interface search, what it does is to provide controlled ways into the data, and we present the data in a certain way. A user comes to a site, sees a keyword search box and enters a term, such as ‘antarctic exploration’. They have certain expectations of what they will get – some kind of list of results that are relevant to antarctica and famous explorers and expeditions – and yet they may not think much about the process – will all records that have any/either/both of these terms be returned, for example? Will the results be comprehensive? Might there be more effective ways to search for what they want? As those who run systems, we have to decide what a search is going to give the user. Will we look for these terms as adjacent terms and single terms? Will we return results from any field? How will we rank the results? We recently revised the relevance ranking on the Hub because although it was ‘pragmatically’ correct, it did not reflect what users expect to see. If a user enters ‘sir john franklin’ (with or without quotation marks) they would expect the Sir John Franklin Papers to come up first. This was not happening with the previous relevance ranking. The point here is that we (the service providers) decide – we have control over what the search returns and how it is displayed, and we do our best to provide something that will work for users.

Similarly, we decide how to display the results. We provide as a basis collection descriptions, maybe with lower-level entries, but the user cannot display information in different ways. The collection remains the indivisible unit.

With a Web interface we are providing (we hope) a user-friendly way to search for descriptions of archives – one that does not require prior knowledge. We know that users like a straightforward keyword search, as well as options for more advanced searching. We hide all of the mechanics of running the search and don’t really inform the user exactly what their search is doing in any kind of technical sense. When a user searches for a subject in the advanced subject search, they will expect to get all descriptions relating to that subject, but that is not necessarily what they will get. The reason is that the subject search looks for terms within the subject field. The creator of the description must put the subject in as an index term. In addition, the creator of the description may have entered a different term for the subject – say ‘drugs’ instead of ‘medicines’. The Archives Hub has a ‘subject finder’ that returns results for similar terms, so it would find both of these entries. However, maybe the case of the subject finder makes a good point about searching: it provides a really useful way to find results but it is quite hard to convey what it does quickly and obviously. It has never been widely used, even though evidence shows that users often want to search by subject, and by entering the subject as a keyword, they are more likely to get less relevant results.

These are all examples of how we, as service providers, look to find ways to make the data searchable in ways that we think users want and try to convey the search options effectively. But it does give a sense that they are coming into our world, searching ‘our data’, because we control how they can search and what they see.

Linked Data is a different way of formatting data that is based upon a model of the entities in the data and relationships between them. To read more about the basics of Linked Data take a look at some of the earlier posts on the Locah blog (http://blogs.ukoln.ac.uk/locah/2010/08/).

Providing machine interfaces gives a number of benefits. However, I want to refer to two types of ‘user’ here. The ‘intermediate user’ and the ‘end user’. The intermediate user is the one that gets the data and creates the new ways of searching and accessing the data. Typically, this may be a developer working with the archivist. But as tools are developed to faciliate this kind of work, it should become easier to work with the data in this way. The end user is the person who actually wants to use the data.

1) Data is made available to be selected and used in different ways

We want to provide the ability for the data to be queried in different ways and for users to get results that are not necessarily based upon the collection description. For example, the intermediate user could select only data that relates to a particular theme, because they are representing end users who are interested in combining that data with other sources on the same theme. The combined data can be displayed to end users in ways that work for a particular community or particular scenario.

The display within a service like the Hub is for the most part unchanging, providing consistency, and it generally does the job. We, of course, make changes and enhancements to improve the service based on user needs from time to time, but we’re still essentially catering for one generic user as best we can, However, we want to provide the potential to allow users to display data in their own way for their own purposes. Linked Data encourages this. There are other ways to make this possible of course, and we have an SRU interface that is being used by the Genesis portal for Women’s Studies. The important point is that we provide the potential for these kinds of innovations.

2) External links begin the process of interconnecting data

Machine interfaces provide flexible ways into the data, but I think that one of the main selling points of Linked Data is, well, linking data. To do this with the Hub data, we have put some links in to external datasets. I will be blogging about the process of linking to VIAF names (Virtual International Name Authority File), but suffice to say that if we can make the statement within our data that ‘Sir Ernest Shackleton’ on the Hub is the same as ‘Sir Ernest Shackleton’ on VIAF then we can benefit from anything that VIAF links to DBPedia for example (Wikipedia output as Linked Data). A user (or intermediate user) can potentially bring together information on Sir Ernest Shackleton from a wide range of sources. This provides a means to make data interconnected and bring people through to archives via a myriad of starting points.

3) Shared vocabularies provide common semantics

If we identify the title of a collection by using Dublin Core, then it shows that we mean the same thing by ‘title’ as others who use the Dublin Core title element. If we identify ‘English’ by using a commonly recognised URI (identifier) for English, from a common vocabulary (lexvo), then it shows that we mean the same thing as all the other datasets that use this vocabulary. The use of common vocabularies provides impetus towards more interoperability – again, connecting data more effectively. This brings the data out of the archival domain (where we share standards and terminology amongst our own community) and into a more global space.  It provides the potential for intermediate users to understand more about what our data is saying in order to provide services for end users. For example, they can create a cross-search of other data that includes titles, dates, extent, creator, etc. and have reasonable confidence that the cross-search will work because they are identifying the same type of content.

For the Hub there are certain entities where we have had to create our own vocabulary, because those in existence do not define what we need, but then there is the potential for other datasets to use the same terms that we use.

4) URIs are provided for all entities

For Linked Data one of the key rules is that entities are identified with HTTP URIs. This means that names, places, subjects, repositories, etc. within the Hub data are now brought to the fore through having their own identifier – all the individuals, for example, within the index terms, have their own URI. This allows the potential to link from the person identified on the Hub to the same person identified in other datasets.

Who is the user?

So far so good. But I think that whilst in theory Linked Data does bring significant benefits, maybe there is a need to explain the limitations of where we are currently at.Hub Sparql endpoint

Our Linked Data cannot currently be accessed via a human user friendly Web-based search interface; it can however be accessed via a Sparql endpoint. Sparql is the language for querying RDF, the format used for Linked Data. It shares many similarities to SQL, a language typically used for querying conventional relational databases that are the basis of many online services. (Our Sparql endpoint is at http://data.archiveshub.ac.uk/sparql ). What this means is that if you can write Sparql queries then you’re up and running. Most end users can’t, so they will not be able to pull out the data in this way. Even once you’ve got the data, then what? Most people wouldn’t know what to do with RDF output. In the main, therefore, fully utilising the data requires technical ability – it requires intermediate users to work with the data and create tools and services for end users.

For the Hub

we have provided Linked Data views, but it is important not to misunderstand the role of these views – they are not any kind of definite presentation, they are simply a means to show what the data consists of, and the user can then access that data as RDF/XML, JSON or Turtle (i.e. in a number of formats). It’s a human friendly view on the Linked Data if you access a Hub entity web address via a web browser. If however, you are a machine wanting machine readable RDF visiting the very same URI, you would get the RDF view straight off. This is not to say that it wouldn’t be possible to provide all sorts of search interfaces onto the data – but this is not really the point of it for us at the moment – the point is to allow other people to have the potential to do what they want to do.

The realisation of the user benefit has always been the biggest question mark for me over Linked Data – not so much the potential benefits, as the way people perceive the benefits and the confidence that they can be realised. We cannot all go off and create cool visualisations (e.g. http://www.simile-widgets.org/timeline/). However, it is important to put this into perspective. The Hub data at Mimas sits in directories as EAD XML. Most users wouldn’t find that very useful. We provide an interface that enables users with no technical knowledge to access the data, but we control this and it only provides access to our dataset and to a collection-based view. In order to step beyond this and allow users to access the data in different ways, we necessarily need to output it in a way that provides this potential, but there is likely to be a lag before tools and services come along that take advantage of this. In other words, what we are essentially doing is unlocking more potential, but we are not necessarily working with that potential ourselves – we are simply putting it out there for others.

Having said that, I do think that it is really important for us to now look to demonstrate the benefits of Linked Data for our service more clearly by providing some ways into the Linked Data that take advantage of the flexible nature of the data and the external links – something that ‘ordinary’ users can benefit from. We are looking to work on some visualisations that do demonstrate some of the potential. There does seem to be an increasing consensus within cultural heritage that primary resources are too severed from the process of research – we have a universe of unrelated bits that hint at what is possible but do not allow it to be realised. Linked Data is attempting to resolve this, so it’s worth putting some time and effort into exploring what it can do.

We want our data to be available so that anyone can use it as they want. It may be true that the best thing done with the data will be thought of by someone else. (see Paul Walk’s blog post for a view on this).

However, this is problematic when trying to measure impact, and if we want to understand the benefits of Linked Data we could do with a way to measure them. Certainly, we can continue to work to realise benefits by actively working with the Linked Data community and encouraging a more constructive and effective relationship between developers and managers. It seems to me that things like Linked Data require us to encourage developers to innovate and experiment with the data, enabling users to realise its benefits by taking full advantage of the global interconnectivity that is the vision of the Linked Data Web. This is the aim of UKOLN’s Dev CSI project – something I think we should be encouraging within our domain.

So, coming back to the starting point of this blog: The data maybe starts off as ‘our data’ but really we do indeed want it to be everyone’s data. A pick ‘n pix environment to suit every information need.

Flickr: davidlocke's photostream