Reconcile Textual Data
Reconciliation is the process of matching textual data with Objects or Categories stored in your nodegoat environment. Reconciliation Processes that you run in nodegoat are based on pattern matching. Patterns are generated based on data selected from your environment. These patterns are reconciled with textual data that is also present in your nodegoat environment. Some examples of Reconciliation Processes that you can run:
- Reconcile place names stored as strings against a gazetteer.
- Reconcile personal names mentioned in a text against a dataset of people.
- Reconcile books, professions, ideas, etc. mentioned in a text against one or more datasets of various kinds of entities.
This guide will show you how to reconcile place names mentioned in a text against a gazetteer. We will use the book Gedenkboek van den oorlog in 1870 en 1871 written by August Snieders and published in 1872 for this purpose. You can download an XML and .txt version of this book via the Digital Library for Dutch Literature. This book of remembrance of the Franco-Prussian War (1870-1871) aimed to offer the Dutch public a comprehensible account of the war.
Reconciliation Processes can be configured to run semi-automatically in order to check if a pattern has been matched correctly. This configuration is ideal for going through large amounts of textual data, while being able to accept or discard identified matches. Identified matches can be stored as 'Pattern Pairs' that can be used to automatically match new occurrences of the pattern.
Reconciliation Processes can also be configured to run fully automated. This configuration is ideal when you have little ambiguity in your Reconciliation Process and you want to efficiently reconcile textual data against another dataset.
Select the data
The first thing you need to do when you plan to run any Reconciliation Process is to decide what your source data will be and what data you want to reconcile against. One Object Type will provide the source data (e.g. texts) while another Object Type will provide the reconcile data (e.g. a list of locations). For both of these Object Types you need to select which Descriptions contain the data that will be the source data (e.g. value of the text) or the reconcile data (e.g. name of the location).
To illustrate this, we can define the source and reconcile data for the examples mentioned above:
- Reconcile place names stored as strings against a gazetteer
- Source Data: Objects of the Type 'Company' that have an Object Description 'Place' configured as 'String'. While these Objects have a name of a location listed for each company, these values cannot be used for visualisation or analysis functionalities at the moment.
- Reconcile Data: Objects of the Type 'City' that have an Object Description 'Name' configured as 'String'. These Objects have both a name of a location plus a latitude and longitude value stored for each place.
The Reconciliation Process aims to link a 'Company' Object to an Object of the Type 'City' based on a match of the company's 'Place' value and the 'Name' of a city. After this process, a geographical visualisation can be made of the companies.
- Reconcile personal names mentioned in a text against a dataset of people
- Source Data: Objects of the Type 'Publication' that have an Object Description 'Text' configured as 'Text (tags & layout)'. While you can search and filter for mentioned people, no references have been made from the people mentioned in the text to the Objects of people in the Object Type 'Person'.
- Reconcile Data: Objects of the Type 'Person' that have the Object Descriptions 'Family Name' and 'Given Name' configured as 'String'.
The Reconciliation Process aims to link a personal name mentioned in the text of a publication to an Object of the Type 'Person' based on a match of the personal name mentioned in the text and the 'Family Name' and 'Given Name' of a person. After this process, the texts will contain in-texts references to Objects of the Type 'Person' that can be used to generate social networks, and to easily browse the data.
- Reconcile books, professions, ideas, etc. mentioned in a text against one or more datasets of various kinds of entities
- Source Data: Objects of the Type 'Letter' that have an Object Description 'Transcription' configured as 'Text (tags & layout)'. While you can search and filter for mentioned entities, no references have been made from the entities mentioned in the text to the Objects of the Type 'Entity'.
- Reconcile Data: Objects of the Type 'Entity' that have an Object Description 'Name' configured as 'String'.
The Reconciliation Process aims to link the name of an entity mentioned in the text of a letter to an Object of the Type 'Entity' based on a match of the entity name mentioned in the text and the 'Name' of an entity person. Alternatively, you can also run this Reconciliation Process multiple times and change the Object Type to reconcile against. This allows you to store your data in separate Object Types like 'Book', 'Profession', 'Concept', etc. and target any of these Object Types in different processes. The resulting references allow you to generate a multi-modal network with multiple kinds of nodes.
For this Guide we will use the texts of each page of the book Gedenkboek van den oorlog in 1870 en 1871 as source data. This data is stored in Objects of the Type 'Page'. These Objects have an Object Description 'Text' that is configured as 'Text (tags & layout)'. The data to reconcile against are names of Objects of French municipalities. This data is stored as Objects of the Type 'Location'. These Objects have an Object Description 'Name' that is configured as 'String'.
Create a Reconciliation Process
Once you know what data you plan to use, you can configure the Reconciliation Process. Enable these processes by going to Management and select 'Projects'. Edit your project and enable the System Process 'Reconciliation'. Go to the Data section of your environment. Go to the tab 'Processes', click 'Reconciliation', and click 'Add Reconciliation'.
Give the Reconciliation Process a name like 'Match French Municipalities in Pages'.
Source Value
Use the dropdown menu with the label 'Type' to select the Object Type 'Page'. You can use the blue 'filter' button to set a filter in order to use a subset of the Objects of this Type (e.g. to only use a limited amount of pages). Use the dropdown menu with the label 'Source' to select the Object Description 'Text'. You can use the green 'add' button to add additional Object Descriptions as Source data.
Reconcile / Test
Use the dropdown menu with the label 'Type' to select the Object Type 'Location'. You can use the blue 'filter' button to set a filter in order to use a subset of the Objects of this Type (e.g. to use locations from a specific geographical region). Use the dropdown menu with the label 'Value' to select the Object Description 'Name'. You can use the green 'add' button to add additional Descriptions as Reconcile Value. Use the icon to sort the selected Descriptions and give them a higher or lower preference in the reconciliation pattern.
When you select multiple Descriptions as 'Reconcile / Test' data, the order of the selected Descriptions determines the weight of each Description. The higher the order of a Description, the more weight is given to its value.
Store Result
The next step is to configure how the matches should be stored in the Objects of the Type 'Page'.
Leave the 'Save' option set to 'Append'. Change this to 'Overwrite' if you want to overwrite existing data.
Since the Source data is configured as 'Text (tags & layout)' the 'Tags' option can be configured. Set this to 'First Match Only' if you only want to tag the first match. Set this option to 'All Identical Matches' in order to apply a match to all identical matches in the text. Set this to 'No' when you do not want to create a tag for a found match.
To store the found matches as references in an Object Description, Sub-Object Description, or Location Reference, you need to configure your model in such a way that this reference can be stored. This means that to store a reference to an Object of the reconcile data (in this case 'Location'), a reference to this Object Type needs to be present in the Object Type of the source data (in this case 'Page'). You can achieve this by creating an Object Description in the Object Type 'Page' that references the Object Type 'Location'. You can also achieve this by creating a Sub-Object that can have a Location Reference of the Type 'Location'. In this guide we will make use of this last option and use a Sub-Object with the name 'Place'.
Use the dropdown menu with the label 'Object Description' to select '[Place] Location Reference'.
While the option 'All Identical Matches' might improve user experience when reading longer texts with tags, it can lead to unintended matches. When the Object 'Metz' is matched in the sentence 'eene poging tot verdediging van het nog ongeschonden Metz te wagen', the option 'All Identical Matches' would then also tag 'Metz' in a following paragraph 'ultra-democratisch blad Journal de Metz', where another tag might be more applicable.
Click 'Save Reconciliation'.
Run a Reconciliation Process
You now see the newly created Reconciliation Process listed in your overview of Reconciliation Processes. Click the green 'run' button on the right side of this overview to run this Reconciliation Process.
The options that you can configure allow you to specify the manner in which the process runs: the size of each batch to be run, the level of automation when applying or discarding matches, and various levels of sensitivity to be used in the pattern matching process.
The default settings will run the Reconciliation Processes in a semi-automated manner. This means that you will be presented with an overview of the found matches and are able to decide which matches you want to accept and which matches you want to discard. When you start using Reconciliation Processes, these settings are a good way to learn how these processes work.
Process Objects Batch
Use this to set the amount of Objects to be processed in one batch. The amount of Objects to be processed in one batch can be set between 1 and 20. If Auto-Save Results is set to 'None', the amount of Objects in one batch will be displayed for further inspection.
Auto-Save Results
Specify whether results should be auto-saved if Objects have been linked, auto-discarded if no Objects have been linked, or if no automated action should be taken.
- Auto-Save: Any Result: any match with a score above the Score Threshold value will be automatically saved. Objects with no results will be automatically discarded.
- Auto-Save: Fully Matched: only matches with a score of 100 will be automatically saved. All matches with a score higher than the Score Threshold value and lower than 100 can be resolved manually. Objects with no results will be automatically discarded.
- Auto-Discard: No Result: objects with no results will be automatically discarded.
- None: no automated actions will be taken. All matches need to be resolved manually and objects with no results need to be discarded manually.
Score Threshold
Use this value to determine the minimum score of an Object match. Object matches with a computed score that is below this threshold will be discarded. The value of the threshold can be set between 1 and 100. Use a low value to allow for ambiguous Object matches. Use a high score to allow for unambiguous Object matches only.
For example: with a Score Threshold of 25 the text Versailles has two Object matches: 'Versailles (score: 100)' and 'Ailles (score: 66.67)'. Changing the Score Threshold to 80 will result in one Object match: 'Versailles (score: 100)'.
Score Multi-Match Difference
Set this value to specify the maximum difference in score when multiple Object patterns match on the same text. The highest scoring Object will be matched plus other Objects this close to that score. The value can be set between 0 and 99. 0 will keep only those Objects that have a score as high as the highest scoring Object. 99 will keep every Object that scores above the score threshold. Use a high value to allow for multiple ambiguous Object matches. Use a low score to allow for singular unambiguous Object matches only.
For example: with a Multi-Match Difference of 25 the text Auberive has one Object match: 'Auberive (score: 100)'. Changing the Multi-Match Difference to 40 will result in two Object matches: 'Auberive (score: 100)' and 'Aubérive (Marne) (score: 66.67)'.
Score Multi-Overlap Difference
Maximum difference in score when multiple Object patterns overlap on the same text. The highest scoring Object will be matched plus other Objects this close to that score. The value can be set between 0 and 99. 0 will keep only those Objects that have a score as high as the highest scoring Object. 99 will keep every Object that scores above the score threshold. Use a high value to allow for multiple ambiguous Object matches. Use a low score to allow for singular unambiguous Object matches only.
For example: with a Score Multi-Overlap Difference of 50 the text Versailles has two Object matches: 'Versailles (score: 100)' and 'Ailles (score: 66.67)'. Changing the Multi-Overlap Difference to 10 will result in one Object match: 'Versailles (score: 100)'.
Pattern Sensitivity
Set the sensitivity to match a value part of an Object's pattern. The value can be set between 0 and 100. A value of 100 allows for the value to match a word; value is surrounded by whitespace or punctuation. A value of 0 allows for the value to match anywhere. Use a low value to allow for Object matches to occur as part of longer strings of text. Use a high score to allow for Object matches to be distinct words.
For example: with a Pattern Sensitivity of 60 the text Versailles has two Object matches: 'Versailles (score: 100)' and 'Ailles (score: 66.67)'. Changing the Pattern Sensitivity to 80 will result in one Object match: 'Versailles (score: 100)'.
Pattern Distance
Change this value to control the allowed maximum distance in characters between separate matched values by an Object's pattern. The lowest value of 1 only allows a single character between separate matched values. Any higher value allows for more characters between separate matched values.
For example: with a Pattern Distance of 1 the text Vrigne (Bois) will only match on the part Vrigne with 'Vrigne aux Bois (score: 88.76)'. Changing the Pattern Distance to 2 will match the full text of Vrigne (Bois) with 'Vrigne aux Bois (score: 88.76)'.
Extend Match
Check this checkbox to extend a matched value to the nearest whitespace or punctuation (i.e. whole word).
For example: with the option Extend Match disabled, the text Parisienne will only match on the part Paris with 'Paris (score: 50)'. With the option Extend Match enabled, the text Parisienne will match the full text of Parisienne with 'Paris (score: 50)'.
Click 'Run Reconciliation' to run the Reconciliation Process. The Reconciliation Process runs and informs you about the results.
Resolving Object Matches
Running the Reconciliation Process with the default settings produces the following results:
Changing the Pattern Sensitivity to a value of 100 removes the matches that are not distinct words:
Changing the Score Threshold to a value of 100 removes the matches that have a score below this value:
Check the checkbox for the results you want to save and click 'Continue' to process the next batch of Objects. If no matches can be resolved, check the checkbox next to the label 'Discard'.
You can also decide to change the 'Auto-Save Results' option to 'Auto-Save: Any Result' or 'Auto-Save: Fully Matched' to automatically save resolved matches.
Once the process is finished you can inspect the results in the 'Source Value' Object Type.