28Mar2017

Formulating Ambiguity in a Database

CORE Admin

One of the most obvious questions to start with when working with structured data in the humanities is: what is data? Miriam Posner has captured this challenge in the title of her talk on this topic: 'Humanities Data: A Necessary Contradiction'. Oftentimes, scholars think about their research material in terms of nuances, vagueness, uniqueness, whereas data is perceived as binary, strict and repetitive. The realisation that nuances, vagueness, and uniqueness can also be captured by data in a database is something that has to grow over time.

As soon as we start talk about 'data' it is important to keep two things in mind. First, we should be ready to reflect on the fact that data oriented processes can dehumanise data. This process has been described by Scott Weingart in his essay on digitising and storing holocaust survivor stories. Even though we can efficiently organise large collections of data, the implications of this process have to be taken into account.

Second: working with a digital tool, does not mean that you can only work with binary oppositions or uncontested timestamps. On the contrary: by creating a good data model, you are able to include all the nuances, irregularities, contradictions and vagueness in your database. A good data model is capable to make these insights and observations explicit. Instead of smoothing out irregularities in the data by simplifying the data model, the model should be adjusted so it reflects the existing complexities, vagueness, and uncertainties. As Katie Rawson and Trevor Muñoz have stated in their essay 'Against Data Cleaning, scholars should "see the messiness of data not as a block to scalability but as a vital feature of the world which our data represents and from which it emerges."

In this blog post we aim to share a couple of insights we gained over the past years that may be of help when you start using a database:

How to determine the scope of your research?
How to reference entries in a dataset and how to deal with conflicting sources?
How to deal with unknown/uncertain primary source material?
How to deal with unique/specific objects in a table/type?
How to use/import 'structured' data?

How to determine the scope of your research?

Determining the scope of your research is a part of every research process, be it digital or not. Besides the obvious side of this challenge related to the statement: 'What questions is my research process trying to answer?', the digital perspective adds a new challenge to this: the trap of completeness. This is not about the appreciation that a research process is never finished as more/new source material keeps popping up, but about the more banal practice of trying to make your lists complete.

This process can be illustrated by the example of the project on epistolary networks in our previous blog post 'What is a Relational Database?'. We showed that it is helpful to create a separate table for the people who sent and received letters. A natural thing to do here is to add some biographical information, e.g. the date of birth, date of death and possibly also the place of birth and place of death. While this information will be easy to find for the most important figures in your network, this information might be harder to find for the people in the periphery of the network. Before you know it, you spend most your time reading biographical works in order to find information on people who are of little value to the central questions of your research. Even though it's a good thing that you now have this information at your disposal in a structured format, sometimes you have to take a step back and ask yourself if/how you will be using the information that you are gathering.

Some of these issues can be taken into account while you are developing your data model. Of course your data model can be very complex, so it answers to your research questions in the best way possible. However, there is a practical side to this: is the source material available that is needed to populate your database? Also: do I have the time to find and provide all the material that the data model asks for?

The question on intended audiences is also important in this respect. Do you create this database for the sake of your research project only, or would it be beneficial for the research community if this collection of data was to be shared? In the latter case, the question on completeness gets a new dimension and the usage of external identifiers for your data becomes essential.

On a more fundamental level, you could also ask yourself whether your research process is question-driven or data-driven. In the early stages of your research process, a data-driven approach might help you to get a good understanding of the availability of the data. A data-driven approach also allows you to prototype your research question. However, as soon as time/resources are limited, you probably need to be make decisions on what data to include and which pieces of information you have to leave out.

How to reference entries in a dataset and how to deal with conflicting sources?

Just like a textual publication, any scholarly database needs to include source references. Not only does this help you manage your references and sources, it also makes the statements in your database traceable (referencing archival records, other primary sources, or secondary literature). It is best to reference every statement you make in your database. You can do this by creating an additional column where you store source references:

If you have multiple sources on the same statement, you can decide to include these references as well. You can either enter multiple source references per record, or you can create a record for each source reference. This last option gives you the opportunity to filter on statements made by specific sources.

Moreover, this also allows you to store contradicting statements with references to the conflicting source material:

You do not have to decide which of these dates you enter into your database as you can include all of them and reference them to the sources on which these are based. This can be done for dates of events but also for any other date or any other kind of statement. By doing so, you can recreate complete historiographical debates in your database.

Wikidata has also implemented this approach: e.g. a couple of statements on 'George Washington' have one or more source references.

How to deal with unknown/uncertain primary source material?

In discussions on visualising humanities data, a lot of attention is directed towards questions regarding 'visualising uncertainty'. This is a very valid discussion since humanities scholars rarely deal with uncontested sources. We need methods that visualise data based on the different levels of their certainty. Nevertheless, these methods are only of use when we have data that has statements on levels of certainty which can be processed by database/analysis/visualisation software. In other words: it's not so difficult to draw a lighter/dashed line that represents an uncertain relationship; it's much harder to find/create a dataset that has structured statements on the certainty of their relationships.

Sticking to the example of dates: there are several questionable practices, shortcuts, on dealing with uncertainty that appear regularly and which are easy to avoid. Oftentimes, a column titled 'date' is filled with statements like 'around 1650', 'may-june 1911', 'a Wednesday in April 1865', 'end of 1954', '1631?', and even '1631????'. Even though these statements convey information and meaning, it is hard for someone not familiar with the sources to value these statements, and they are even harder to process for any database application let alone analysis or visualisation software.

Four strategies that can be helpful here:

Store the separate elements of the date in separate columns. See the examples above on the eruption of the Vesuvius. If you lack one element of the date, you just leave this column blank. This approach is intuitive, since it gives a clear overview of missing dates and it also ensures interoperability. Some software packages require you to use a date format like 2017-03-23, while you might also want to be able to print your dates like 03-23-2017. Both options remain available with this approach. Moreover, you don't rely on any date formatting issues that arise when your data moves from one platform to another or from one language to another.
Decide upon rules for missing dates or estimated dates. If you know something happened in the middle of 1651, you could enter a date like 15-06-1651. You can decide upon rules like this for the middle of months, start of the year, end of the year, etc. The benefit of this approach is that you always have completed dates at your disposal, making it easy to use them in software packages that have a hard time processing incomplete dates.
Extend on the previous point and add an additional column that contains statements on the certainty of the entered dates. This column can contain a value like 'estimate', 'uncertain' or 'inferred', so you know that a date like '15-06-1651' was inferred by you, but if a date is followed by a statement like 'as in source' you know that this date was found like that in the source.
You can also introduce columns like 'before date' and 'after date' to record a period in which something has taken place. When you encounter an undated letter, you can usually determine that it was sent in the period between two other letters. You can then date this letter by stating it was sent after the date of the first letter and before the date of the second letter. This method can be applied in other situations as well to approximately date paintings, dates of birth, protests, etc.

These strategies for dealing with uncertain/vague dates can be applied to other types of information. Locations can be inferred as well. Statements on relationships can be followed by a column in which you classify these statements by levels of certainty (e.g. 'certain', 'probably', 'uncertain'). The most important point here is that you make every vagueness/uncertainty explicit by following a certain logic since logic can be interpreted by both man and machine.

A good example of this approach can be found in the Ecartico database project at the University of Amsterdam. Each relationship has a modifier which states how certain a relation is: http://www.vondel.humanities.uva.nl/ecartico/persons/6292.

How to deal with unique/specific objects in a table/type?

Scholars sometimes go too far in structuring their data. In art historical projects, some scholars tend to make different tables for the different kinds of people they encounter in their project. For instance: one table for all the artists, one table for all the collectors and traders, and one table for the muses of artists. This approach is problematic since there could be people that are identified to fall in all these three categories and will therefore have to be entered three times. It's easy to avoid this problem: just create one table for all the people and classify these people based on the roles/relationships they have within your project.

Another issue that occurs: neverending spreadsheets. The longer you work on a project, the more attributes you encounter that might be relevant for your project. Technically, you could keep adding columns to keep track of these attributes but this will probably result in numerous columns that are only used in a few occasions. If this happens it's good to ask yourself if it ever will be relevant to use these attributes to query your database. If data is highly specific, don't store it structurally. In these cases, a text/notes field could be most helpful to store and document very specific information.

How to use/import 'structured' data?

You might have (semi-)structured sources that can be used to populate your database. These can be in a non-digital form, like registers, catalogues, or member lists of societies. You might also have these lists already in a digital form: spreadsheets you or other scholars compiled before. Even though these lists can be very helpful, they rarely come with unique identifiers per record or a level of completeness that allows you to use them properly in a database or visualisation application. The following checks will help to identify some common challenges:

Are all the dates formatted consistently in a machine-readable format? See the strategies mentioned above that help to deal with vague/uncertain dates.
Have all the location been disambiguated? In case locations have been identified by a string like 'Lemberg' or 'Springfield', an effort has to be made to identify which Lemberg and which Springfield is meant. This can be done by relating all used locations to a new table and supplementing them with external unique identifiers like geonames.org/702550 or geonames.org/2637194 .
Just as locations, all other kinds of information may also need to be disambiguated. As soon as people are mentioned multiple times within a dataset, it's better to work with unique identifiers. This helps to avoid typos, and also helps you to deal with homonyms.
If information has been stored multiple times throughout a dataset, you will probably have to reconsider the structure of the data. Capacities of people can best be stored in a table that specifically deals with people instead of including the capacity of a person in every letter they have sent. More on this can be found in the previous blog post on a relational approach to a research process on epistolary networks.
To identify any typos or spelling variations in a data set, a tool like Open Refine can be of great help. Instead of smoothing out irregularities (and thereby discarding possible valuable data), you can use Open Refine to add unique identifiers to be able to identify objects that have been spelled in different ways throughout the dataset.

This list is not exhaustive and is dependent on the data you are working with. You might also need to look into source referencing of the statements made within the dataset. The basic idea here is that even though your data may be structured, it might not be actionable yet. The points brought forward in this blog post hopefully could help you get there.

data modelling database