All posts by Léon Robichaud

In search of Canadian HGIS data

Researchers who study Canada have generated large quantities of geohistorical data for many years. While we reflect on the creation of a national geohistorical infrastructure, it is pertinent to identify datasets at different scales which can become a part of such a portal. We are therefore trying to enhance the discoverability of existing and available datasets. In the long run, it would be preferable to enumerate and describe each layer and each attribute  table, it is not necessary, for the moment, to delve at such a level of detailed granularity. We hope, at this stage, to identify collections which have emerged from different research projects or from the online deposit of previously georeferenced digital data such as:

  • raster geographic maps
  • aerial photographs
  • vector layers
  • attribute data linked to vector layers

We have already identified datasets offered by different types of creators so that we can present diversity in the nature and the type of data which can interest researchers. We have therefore identified:

  • quality international data (FAO)
  • data from collaborative mapping projects (Open Street Map, Natural Earth)
  • data available on GIS company web sites  (ESRI)
  • national data (government of Canada, Géogratis)
  • provincial or territorial data (British Columbia, Yukon, Québec, Nova Scotia, Prince Edward Island, New Brunswick)
  • municipal data  (Toronto, Montréal, Sherbrooke)
  • research team data (CIEQ, NICHE, LHPM, MAP, VIHistory)
  • data from map library and archive centres (Scholars’ Geoportal, MADGIC, GéoIndex+)
  • personal initiative data  (historical railway lines )

Choosing what type of metadata to associate with each dataset has meant achieving a compromise. An insuficient level of detail would prevent effective searches while requirements for overly detailed metadata could discourage data creators who are not trainted to  create metadata which meet international standards. According to Rodolphe Devillers, we can use six criteria to define the quality of a geospatial dataset1.

i. Definition : Allows the user to evaluate if the nature of a datum and of the object it describes, i.e. “what”,  meets his or her requirements (semantic, spatial and temporal definitions);

ii. Coverage : Allows the user to evaluate if the territory and the period for which the data exists, i.e. the “where” and the “when”, meet his or her requirements ;

iii. Genealogy : Allows the user to know where the data came from, the project’s objectives when the data was acquired, the methods used to obtain the data, i.e. the “how” and the “why” and to verify if this meets the user’s requirements ;

iv. Precision : Allows the user to evaluate the data’s worth and if it is acceptable for the user’s requirements (semantic, temporal and spatial precision of the object and of its attributes);

v. Legitimacy : Allows the user to evaluate the official recognition and the legal standing of the data and if it meets the user’s requirements (de facto standards, recognised good practices, legal or administrative recognition by an official agency, legal garantee by a supplier, , etc.);

vi. Accessibility : Allows the user to evaluate how easily the user can obtain the data (cost, delays,format, privacy, respect of recognised good practices, copyright, etc.).

A metadata standard which would meet all of these criteria may seem overwhelming for may people who would like to make their data available. We therefore propose to use the format defined by the Dublin Core Metadata Initiative, an international standard for which the types of fields are easier to understand for people less familiar with metadata. We have applied and interpreted the DCMI based upon its general definition available on Wikipedia2 and on the interpretation of a few fields proposed by the Bibliothèque nationale de France3. This approach can certainly be criticised, because it is geared towards a simple application rather than perfection. Based on how metadata will be entered in this list, we can refine these principles to improve this compromise. The fields do not appear in the same order as in the DCMI and some are subdivided to provide for a slightly finer level of granularity.

Table 1. List of fields used to describe datasets

Élément (French) Élément (EnGLISH) Comment
Créateur Creator The main entity responsible for creating the content of the resource. It can be the name of one or many people, an organisation, or a service.
Format : Last name, First name.
Separate multiple entities with a semi colon.Optional
Contributeur Contributor Entity reponsible for contributing to the content of the resource. It can be the name of one or many people, an organisation, or a service.
Format : Last name, First name.
Separate multiple entities with a semi colon.Optional
Titre Title Name given to the resource.
The title is genarally the formal name under which the resource is known. Indicate the title in the language of origin of the resource.If the resource does not have a formal title and if the title is derived from the content, place the title between square brackets.Required
Description.Générale Description.General A presentation of the content of the  resource. Examples of descriptions are generally in free form text. As much as possible, use the description provided by the creators of the resource.

Optional

Description.Nature-du-projet Description.Project-type A key word which allows us to categoriese projects according to the following typology:

– gouvernemental
– NGO
– academic
– individual
– commercial
– collaborative

Required

Description.Méthodologie Description.Methodology Free form text which describes the process used to create the resource.

Required

Description.Sources Description.Sources List of documents which were used to create the resource. This field is different from the field Source, which is used to identify where a user can acquire the resource.

Optional

Description.Champs Description.Fields List of fields used in the table or database, preferably with a description.

Optional

Date.Publication Date.Published Date where the resource was originally created. This is not necessarily the date represented by the resource.

Required

Date.Mise-à-jour Date.Updated Date of an update event in the life cycle of the resource.

Optional

Couverture.Temps Coverage.Time Perimeter or domain of the resource, in this case, the date, the year or the period represented by the resource.

Required

Couverture.Espace Coverage.Space Perimeter or domain of the resource, in this case, the territory. It is recommended to use a value from a controled vocabulary.

Required

Couverture.Niveau Coverage.Level A key word which identified the level of the spatial coverage of the resource:

– international
– national
– provincial
– regional
– municipal
– local

Required

Sujet.ISO Subject.ISO A keyword which allows us to link the resource to one of the ISO categories of geospatial data.

– agriculture / farming
– biota / biota
– limites administratives / boundaries
– climatologie / climatology
– économie / economy
– élévation / elevation
– environnement / environment
– information géoscientifique / geoscientific information
– santé / health
– imagerie / imagery
– intelligence / intelligence (militaire)
– eaux intérieures / inland waters
– localisation / location
– océans / oceans
– urbanisme / planning
– société / society
– structure / structure
– transport / transportation
– services publics / utilities

Voir : https://geo-ide.noaa.gov/wiki/index.php?title=ISO_Topic_Categories

Required

Sujet Sujet One or several keywords which can be used to categorise the resource.

Optional

Format Format The physical or in this case, the digital manifestation of the resource, ie, the MIME type of the document :

– shp
– kml
– kmz
– zip
– csv
– other formats used in GIS

Required

Langue Language The language of the intellectual content of the resource.
It is recommended to use a value defined in RFC 3066 [RFC3066] which, with the ISO 639 [ISO639] standard, defines 2 letter primary language codes, as well as optional subcodes.
Exemples :- en
– frRequired
Type de ressource Type Type of content.
By default, the resources identified as part of this project are part of the dataset type.Required
Droits.Licence Rights.License Brief indication of the type of licence which applies to the data:

– copyright
– CC (or one of its variations)
– public domain
– open

Required

Droits.Accessibilité Rights.Access One of the following termes will allow us to indentify how the data can be accessed.

– free
– one time payment
– free subscription
– paid subcription

Required

Droits.Conditions d’utilisation Rights.Terms of use Text copied and pasted from the web site where the data is deposited to specify the creators’ terms of use.

Optional

Source Source Location from which a user can obtain the resource. This will generally be a URL.  A Source.URI could be added should it become pertinent.

Required

Relation Relation Link to other resources. A resource can be derived from another or can be associated with another as part of a project.
Exemples : isPartOf [other resource number]
isChildOf [other resource number]
isDerivedFrom [other resource number]Optional
Éditeur Publisher Name of the person, organisation or service which published the document.

Optional

Commentaire Comment Any additionnal information which can help users better undertand the resource.

Optional

 

A list of identified resources is available here:  http://bit.ly/2rlIkRC. Some of the notices are incomplete and we are working on completing them. If you would like to propose a dataset, you can fill out the form available here: http://geohist.ca/donnees-sigh-hgis-data-form

1  DEVILLERS, Rodolphe (2004). « Conception d’un système multidimensionnel d’information sur la qualité des données géospatiales », [En ligne], Ph. D., Université Laval <http://theses.ulaval.ca/archimede/fichiers/22242/22242.html>.

2  Collaborateurs de Wikipédia (2016). « Dublin Core » <https://fr.wikipedia.org/wiki/Dublin_Core#Liste_des_.C3.A9l.C3.A9ments_et_raffinements>.

3  Bibliothèque nationale de France, Direction des Services et des Réseaux, Département de l’Information bibliographique et numérique (2008). « Guide d’utilisation du Dublin Core (DC) à la BnF : Dublin Core simple et Dublin Core qualifié, avec indications pour utiliser le profil d’application de TEL », version 2.0 <http://www.bnf.fr/documents/guide_dublin_core_bnf_2008.pdf>.

How do we find and link all this geohist information?

The volume of geohistorical data available on the web and stored in various databases is expanding rapidly as the geospatial turn gains momentum and as online mapping tools become more accessible. Historical maps can be situated with a bounding box or georeferenced with precision. Aerial photographs are assembled and georeferenced to analyse a region or to easily locate a specific sheet. Animated or static maps are increasingly being used to visualise phenomenons which affected history at various scales : local (Don Valley Historical Mapping Project), regional (Map of how the Black Death devastated medieval Britain), national (American Panorama. An Atlas of United States History), continental (Mapping the Republic of Letters), trans-Atlantic (The Trans-Atlantic Slave Trade Database) or global (Time-Lapse Map of Every Nuclear Explosion, 1945-1998).

Faced with massive amounts of data, researchers are not just looking for the proverbial needle in the haystack. They need to search for many needles spread across many haystacks. Several initiatives have been undertaken, including by this group, to develop solutions which would improve accessibility to geohistorical data. Portals are generally viewed as a solution to bring together data which pertains to a given location or to the research interests of a group or an institution. Consciously or not, they are designed to showcase the work of a group or institution. We will still need portals as infrastructures to host and distribute geospatial data. But on their own, they will not resolve issues of discoverability, openness and interoperability.

Depending on how effective the developers are at search engine optimisation, a given portal will be more or less easy to find on the web. The user will generally land on the portal’s home page and will then use the system’s own search tools to identify the specific item or items related to her or his research. Some systems, such as GeoIndex+, combine faceted search with a spatial view to facilitate discovery. Others still rely on older catalogue inspired search engines.

Whether or not the desired data can be located, it may not be available for download. Apart from commercial licensing issues, many researchers are still reticent to make their data available for download, but this would be an issue for a separate post. Governments are gradually making data freely available, but there is still a chance that a researcher could end up digitising and georeferencing data which already exists in that form. At this point, the use of a file format incompatible with a researcher’s preferred software becomes a minor inconvenience.

Even when portal developers have the best intentions to make data available and downloadable, the lack of system interoperability makes cross-portal searches a difficult challenge to overcome unless they open API’s or make data available in a linked and open format. While API’s could resolve immediate issues, they would not solve the problems related to security, system maintenance and overhauls. I will therefore emphasise linked and open data as the most promising long term solution to the problem.

Linked data “is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.” (Source). A World Wide Web Consortium (W3C) standard, it forms the basis for the semantic web as defined by Tim Berners-Lee.

LOD relies upon the Resource Description Framework (RDF) which uses a subject – predicate – object grammar to make statements about resources. These triples, which could also be seen as entity – attribute – value structures (document X -> is a -> map), are machine-readable and use Uniform Resource Identifiers (URIs) to connect different elements together. LOD is already used to make information available and connected in projects such as DBpedia.

The data structures presented as rdf statements are defined by ontologies. The Spatial Data on the Web Working Group  has been formed by the W3C to

  • to determine how spatial information can best be integrated with other data on the Web;
  • to determine how machines and people can discover that different facts in different datasets relate to the same place, especially when ‘place’ is expressed in different ways and at different levels of granularity;
  • to identify and assess existing methods and tools and then create a set of best practices for their use;
    where desirable, to complete the standardization of informal technologies already in widespread use.
    [SDWWG Mission Statement]

Such an initiative will provide us with the tools and the infrastructure to make geohistorical data discoverable and accessible.

Unfortunately, LOD is not a simple solution to implement. Competing ontologies could emerge, which would limit interoperability unless bridges are made to define equivalences. Some institutions’ insistence on defining their own URIs, for place names for example, without connecting them to other authority lists can recreate the silos that we are trying to avoid. Many stakeholders need to open and offer their research data as rdf triples for the web of geohistorical data to emerge, as is already the case with DBpedia, Geonames, and the World Factbook. Designed as infrastructure, LOD tools are still in development and they  do not have much of a “wow” factor which would bring visibility and investment. A pilot project with a strong front end will be required for people to understand what LOD can do so that they will invest the resources required to publish geohistorical data as rdf triples.

There are still issues to be resolved, such as a standard ontology or a set of compatible ontologies. The SDWWG proposes compatibility with upper ontologies, as opposed to dependence upon a given world view of linked data [SDWWG Best Practices Statement]. We must also expect that different teams will publish their data at different levels of granularity. Some will at least provide metadata to indicate that a dataset has social and economic information about Montreal in 1825 while another could publish each data element at the household level. With regards to a scholar’s career, how can this type of publication be recognised for hiring, tenure and grants? The Collaborative for Historical Information and Analysis  has studied data repository practices which can be useful as we move towards LOD. Finally, how will we flag data which is less than recommended for scholarly research? We will need to define peer-review for an LOD world.

There are obviously more questions than answers at the moment, linked and open data provides a long term solution to discoverability and accessibility. Such a solution should be part of future portal designs.

To go further, the SDWWG lists a few publications and presentations. Catherine Dolbear and Glen Hart’s Linked Data: A Geographic Perspective (CRC Press, 2013) can also provide further guidance to the use of linked data from a geographic perspective. Any search for linked data or the semantic web will provide many useful results for additional reading. For historians, Philippe Michon’s M.A. thesis, « Vers une nouvelle architecture de l’information historique : L’impact du Web sémantique sur l’organisation du Répertoire du patrimoine culturel du Québec », is highly recommended.

Léon Robichaud
Professeur agrégé
Département d’histoire
Université de Sherbrooke