How do we find and link all this geohist information?

The volume of geohistorical data available on the web and stored in various databases is expanding rapidly as the geospatial turn gains momentum and as online mapping tools become more accessible. Historical maps can be situated with a bounding box or georeferenced with precision. Aerial photographs are assembled and georeferenced to analyse a region or to easily locate a specific sheet. Animated or static maps are increasingly being used to visualise phenomenons which affected history at various scales : local (Don Valley Historical Mapping Project), regional (Map of how the Black Death devastated medieval Britain), national (American Panorama. An Atlas of United States History), continental (Mapping the Republic of Letters), trans-Atlantic (The Trans-Atlantic Slave Trade Database) or global (Time-Lapse Map of Every Nuclear Explosion, 1945-1998).

Faced with massive amounts of data, researchers are not just looking for the proverbial needle in the haystack. They need to search for many needles spread across many haystacks. Several initiatives have been undertaken, including by this group, to develop solutions which would improve accessibility to geohistorical data. Portals are generally viewed as a solution to bring together data which pertains to a given location or to the research interests of a group or an institution. Consciously or not, they are designed to showcase the work of a group or institution. We will still need portals as infrastructures to host and distribute geospatial data. But on their own, they will not resolve issues of discoverability, openness and interoperability.

Depending on how effective the developers are at search engine optimisation, a given portal will be more or less easy to find on the web. The user will generally land on the portal’s home page and will then use the system’s own search tools to identify the specific item or items related to her or his research. Some systems, such as GeoIndex+, combine faceted search with a spatial view to facilitate discovery. Others still rely on older catalogue inspired search engines.

Whether or not the desired data can be located, it may not be available for download. Apart from commercial licensing issues, many researchers are still reticent to make their data available for download, but this would be an issue for a separate post. Governments are gradually making data freely available, but there is still a chance that a researcher could end up digitising and georeferencing data which already exists in that form. At this point, the use of a file format incompatible with a researcher’s preferred software becomes a minor inconvenience.

Even when portal developers have the best intentions to make data available and downloadable, the lack of system interoperability makes cross-portal searches a difficult challenge to overcome unless they open API’s or make data available in a linked and open format. While API’s could resolve immediate issues, they would not solve the problems related to security, system maintenance and overhauls. I will therefore emphasise linked and open data as the most promising long term solution to the problem.

Linked data “is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.” (Source). A World Wide Web Consortium (W3C) standard, it forms the basis for the semantic web as defined by Tim Berners-Lee.

LOD relies upon the Resource Description Framework (RDF) which uses a subject – predicate – object grammar to make statements about resources. These triples, which could also be seen as entity – attribute – value structures (document X -> is a -> map), are machine-readable and use Uniform Resource Identifiers (URIs) to connect different elements together. LOD is already used to make information available and connected in projects such as DBpedia.

The data structures presented as rdf statements are defined by ontologies. The Spatial Data on the Web Working Group  has been formed by the W3C to

  • to determine how spatial information can best be integrated with other data on the Web;
  • to determine how machines and people can discover that different facts in different datasets relate to the same place, especially when ‘place’ is expressed in different ways and at different levels of granularity;
  • to identify and assess existing methods and tools and then create a set of best practices for their use;
    where desirable, to complete the standardization of informal technologies already in widespread use.
    [SDWWG Mission Statement]

Such an initiative will provide us with the tools and the infrastructure to make geohistorical data discoverable and accessible.

Unfortunately, LOD is not a simple solution to implement. Competing ontologies could emerge, which would limit interoperability unless bridges are made to define equivalences. Some institutions’ insistence on defining their own URIs, for place names for example, without connecting them to other authority lists can recreate the silos that we are trying to avoid. Many stakeholders need to open and offer their research data as rdf triples for the web of geohistorical data to emerge, as is already the case with DBpedia, Geonames, and the World Factbook. Designed as infrastructure, LOD tools are still in development and they  do not have much of a “wow” factor which would bring visibility and investment. A pilot project with a strong front end will be required for people to understand what LOD can do so that they will invest the resources required to publish geohistorical data as rdf triples.

There are still issues to be resolved, such as a standard ontology or a set of compatible ontologies. The SDWWG proposes compatibility with upper ontologies, as opposed to dependence upon a given world view of linked data [SDWWG Best Practices Statement]. We must also expect that different teams will publish their data at different levels of granularity. Some will at least provide metadata to indicate that a dataset has social and economic information about Montreal in 1825 while another could publish each data element at the household level. With regards to a scholar’s career, how can this type of publication be recognised for hiring, tenure and grants? The Collaborative for Historical Information and Analysis  has studied data repository practices which can be useful as we move towards LOD. Finally, how will we flag data which is less than recommended for scholarly research? We will need to define peer-review for an LOD world.

There are obviously more questions than answers at the moment, linked and open data provides a long term solution to discoverability and accessibility. Such a solution should be part of future portal designs.

To go further, the SDWWG lists a few publications and presentations. Catherine Dolbear and Glen Hart’s Linked Data: A Geographic Perspective (CRC Press, 2013) can also provide further guidance to the use of linked data from a geographic perspective. Any search for linked data or the semantic web will provide many useful results for additional reading. For historians, Philippe Michon’s M.A. thesis, « Vers une nouvelle architecture de l’information historique : L’impact du Web sémantique sur l’organisation du Répertoire du patrimoine culturel du Québec », is highly recommended.

Léon Robichaud
Professeur agrégé
Département d’histoire
Université de Sherbrooke

This post is also available in: French