5. BioCASe Monitor Services

Home Manual

By Glöckler , F.,  Hoffman, J. & Theeten, F., published as: The BioCASe Monitor Service - A tool for monitoring progress and quality of data provision through distributed data networks, in: Lyubomir Penev (ed), Biodiversity Data Journal 1, 2013, Sofia : Pensoft. DOI: 10.3897/BDJ.1.e968. / Usage rights: Creative Commons CC0

Download the presentation "BIOCASE PROVIDER SOFTWARE & BIOCASE MONITOR SERVICE" given by Falko Gloeckler at TDWG 2013 conference

Summary

The BioCASe Monitor Service (BMS) is a web-based tool for coordinators of distributed data networks that provide information to web-portals and data aggregators via the BioCASe Provider Software. Building on common standards and protocols, it has three main purposes: (1) monitoring provider’s progress in data provision, (2) facilitating checks of data mappings with a focus on the structure, plausibility and completeness, and (3) verifying compliance of provided data for transformation into other target schemas.

Herein two use cases, GBIF-D and OpenUp!, are presented in which the BMS is being applied for monitoring the progress in data provision and performing quality checks on the ABCD (Access to Biological Collection Data) schema mapping.

However, the BMS can potentially be used with any conceptual data schema and protocols for querying web services. Through flexible configuration options it is highly adaptable to specific requirements and needs. Thus, the BMS can be easily implemented into coordination workflows and reporting duties within other distributed data network projects.

Table of contents

1. Introduction

2. Project description

2.1. Technical Background

2.2. BioCASe Monitor Service – Use Cases

3. Discussion and Outlook

4. Additional information

5. References

 

1. Introduction

Go to top

In international biodiversity data initiatives a common goal is to build up distributed network infrastructures, e.g. in the Global Biodiversity Information Facility (GBIF) and the project Opening up the Natural History Heritage for Europeana (OpenUp!). These distributed networks consist of natural history institutions or local aggregators, which provide their data to global aggregation portals, web services or other data consuming and transforming software. The main goal of these networks is to bring together locally distributed information and make it publically available. These infrastructures are often implemented on project bases and thus financed by third party funds. The particular success of the initiatives is often measured by progress indicators (e.g. number of data records or multimedia objects in a time period), which have to be recorded on a regular basis. Thus, continuous progress monitoring is indispensable for the production of high quality status reports. However, the greater the number of involved partners, the greater the challenge for the project coordination to monitor progress in data provision of each provider throughout the consortium and over the entire lifetime of a project. Secondly, monitoring key indicators relevant for the project, such as the amount of published data records per provider, involves recurring and time-consuming queries to each individual data source. Thus, it is highly desirable to facilitate progress monitoring by reducing the number of individually performed requests to decrease the workload and increase the time efficiency.
There are agreed data (exchange) standards and imposed required mandatory concepts depending on the focus and context of the data providing network. The project coordination is then also responsible for the quality assurance of the data provided and is obligated to continuously check the compliance and consistency of the data sources.
Furthermore, distributed networks (especially Europeana) may involve transformations of the data structure into another schema prior to publication, e.g. if an intermediate level for data enrichment is included as in the case of the OpenUp! project (1). Scientific and technical managers of these networks may benefit from a tool enabling a semi-automated method of verification that the data is prepared for transformation into the output schema.
Generally, technical staff and end-users would benefit from a service that gives an overview of the data providers and indicates that the participating provider services are on-line.
This paper presents a service tool, the BioCASe Monitor Service (BMS), which has been developed to match the above mentioned objectives. No other currently available service combines or covers these objectives into a single and easy-to-use tool. The BMS allows monitoring of data provision for a multi-partner consortium and facilitates the quality control of data sources that are connected to the network. The service is primarily intended for project coordinators of distributed networks and scientific or technical staff assessing the data quality, structure and completeness of published data according to the schemas used for data publication in the web.
The BMS has been developed through a collaborative effort between two data projects (GBIF-D and OpenUp!) dealing with biodiversity collection and observational data. However, it has the flexibility to be used by other communities of data providers as well.
 

2. Project description

Go to top

pTitle: BioCASe Monitor Service

Study area description:
 

2.1. Technical Background

Go to top

Data provision in distributed networks

In distributed biodiversity data networks the individual providing institutions (providers) manage their data supply by installing a technical infrastructure (e.g. a middleware) on top of their own databases. This is done in order to allow one or several Services or aggregating web portals to access the data via a central interface. Data are directly retrieved from the database located at provider side without resorting to a centralized public architecture for storage. With this approach, providers keep control over their data provision and are flexible in assigning institution-specific data policies. The middleware creates an abstraction layer by mapping the original data model (software- or institution-specific) to a common domain-specific (exchange) schema, e.g. the ABCD schema.
In this step the provider can define the information flow by filtering the fields of the source database that are relevant concepts for the target network. The abstraction layer can then be directly harvested by domain specific harvesting tools, such as the GBIF Harvesting and Indexing Tool (HIT). Re-harvesting of the data source is then possible in order to make available the changes in the data.
In some projects or initiatives this step is iterated, for example if a transformation performed on an intermediate level is necessary for data provision in the domain-specific format to a more general (not domain-specific) structure. This facilitates data indexing and aggregation for web-portals and services, which publish the data for the end-users.
A complex data flow such as this is applied in the OpenUp! project (see Fig. 1), which harvests natural history collection data in ABCD(EFG) format (domain-specific) to pass the data from the so-called OpenUp! Natural History Aggregator along to the Europeana portal, which consumes ESE (Europeana Semantic Elements) or EDM (Europeana Data Model).

 

Figure 1. The architecture in distributed networks using the BioCASe Provider Software (BPS). This is illustrated just for one of many providers (left box) in the distributed network. The BioCASe Monitor Service (BMS) is used for checking the data compliance and requirements prior to the harvesting by the indexing tool (HIT). For the data provision to Europeana an additional transformation from the ABCD or ABCDEFG schema to the Europeana schema (ESE or EDM) is necessary. The requirements for the transformation are checked by the BMS at the provider side.

BioCASe Provider Software

The two biodiversity projects GBIF-D and OpenUp!, which are presented as examples in this paper (see section ‘Use cases’), use the BioCASe Provider Software (4) as middleware. It supports the submission of collection and observational data to distributed networks. Furthermore, it enables the data provider to map their SQL capable databases to XML schemas and offers a XML-over-HTTP web-interface (see http://www.biocase.org/products/provider_software/, http://www.ibm.com/developerworks/xml/library/x-tiphttp/ and http://xmlrpc.scripting.com) for data access and data provision.

The BioCASe Provider Software (BPS) has been primarily developed for the data provision to its own European data portal BioCASE. However, it is widely used within the biodiversity data community, because its compliance to the Global Biodiversity Information Facility (GBIF) was already assured during the development. It is natively configured for the ABCD schema (Access to Biological Collection Data) and its derivatives (see below), as well as for DarwinCore. However, the core component of the BPS is a generic wrapper library, which is capable of handling any conceptual XML schema. Therefore, there are no technical constraints to using the BPS in other areas of biodiversity informatics and natural science. A good example is the Geoscientific Collection Access Service for Europe (GeoCASE), which has been using the BPS since 2007 to aggregate data not only from paleontological, but also mineralogical and geological data sources.
Furthermore, the relatively small effort in setting up the BPS enables a wide range of possibilities for data exchange, because multiple software products are able to harvest and interpret the same data sources. In the example of GeoCASE, GBIF harvests the paleontological, but not the geological data.

Domain-specific standards supported by the BioCASe Provider Software

Official standard schemas and ontologies are designed and ratified by the scientific community i.e. the Taxonomic Databases Working Group (TDWG). This assures that interoperability is warranted across different projects or initiatives.
The subsequent paragraphs briefly describe the data schemas which are mostly used in natural history context and are ordinarily supported by the BPS:

ABCD

The ABCD schema (Access to Biological Collection Data; currently in version 2.06) is a highly complex and extensible XML data model for information on natural history specimen collection and observational data. It has a hierarchical structure and accommodates both atomized and free-text data. Thus, it can be used for a wide range of data in different qualities. ABCD 2.06 is an accepted schema of the Biodiversity Information Standards TDWG and can be used for standardized data exchange in biodiversity contexts, e.g. the data provision to GBIF. It is compatible with many existing data standards. A documentation of the particular elements can be seen at http://wiki.tdwg.org/twiki/bin/view/ABCD/AbcdConcepts.

        - ABCD-EFG

The ABCD Extension For Geosciences (ABCD-EFG) was created to meet the specific needs of paleontological data. As there is a potential overlap in information on abiotic objects (e.g. stratigraphy, rock type) the extension was also designed to serve for geological and mineralogical collection data without any biological information. Consequently, the extension enables data provision of biological, paleontological and geological collection data at the same time. This reduces the efforts in mapping the data of natural history institutions, which curate physical objects of all three domains.

        - ABCD-DNA

In order to provide DNA sample data together with their specimen data via the ABCD schema, generic concepts for supplementary contents (‘MeasurementsOrFacts’; see http://wiki.tdwg.org/twiki/bin/view/ABCD/AbcdConcept1339) would have to be implemented for DNA-specific values. As these values are relatively complex, the extension ABCD-DNA has been designed. It has been proposed to the TDWG as a new official standard schema for DNA data.

DarwinCore

DarwinCore (often abbreviated DwC) is a set of elements from different ontologies and schemas (e.g. Dublin Core; http://dublincore.org/specifications/) for biodiversity and collection data. It has a flat structure that can be extended by domain-specific modules (e.g. geospatial, invasive species). DarwinCore is an accepted TDWG standard. A stand-alone format, the DarwinCore Archive, is a self-described DarwinCore file. It is intended to ease the cataloguing of big datasets by processing them without requiring a live connection to the provider. This format is also useful for publications, because it can create a citable snapshot of a dataset. The BPS is able to convert ABCD data into the DarwinCore Archive format.

The BioCASe Protocol

The BPS communicates via its native query protocol (BioCASe Protocol; currently in version 1.3; http://www.biocase.org/products/protocols/). The BioCASe Protocol is an XML-over-HTTP format defined for sending SQL-like requests to the web service and receiving the respective response in XML format. It enables various possibilities to request metadata e.g. the elements provided (“capabilities” request), number of records (“search” request), number of unique values in a certain element (“scan” request), etc. These are relevant information for the requesting software, for example the Harvesting and Indexing Tool (HIT) used and developed by GBIF. The BioCASe Protocol queries the provided content for indexing and harvesting. Furthermore, it allows for partitioning of transmitted data by filtering and thus limiting the response. This is especially advantageous for handling large data sources, as well as for sequential harvesting. The BioCASe Protocol defines operators that allows diverse combinations of criteria to filter data sources (e.g. comparisons: ‘Equals’, ‘isNull’, ‘lessThanOrEquals’, ‘greaterThan’, etc.; negation: ‘Not’, ‘isNotNull’; combination: ‘And’, ‘Or’). In addition, the protocol can combine the data with information on the operating system and the executed database queries, as well as warnings and error messages. Thus, it reports about the communication with the database and the status to give a comprehensible feedback on what is done in the background. This is particularly beneficial in the case where a malfunction needs to be debugged.
Design description:

 
BioCASe Monitor Service

The BioCASe Monitor Service (BMS) is a web-based tool programmed in PHP (PHP: Hypertext Preprocessor) and JavaScript. It uses the BioCASe protocol for automated compilations and output of information necessary for monitoring several BioCASe data sources simultaneously and performing quality checks of individual data sources. The BMS consists of two interfaces: 1) the BioCASe Monitor, which is a catalogue of all registered data sources including some general metadata and relevant links, and 2) the BioCASe Mapping Checker, which gives a comprehensive overview of the respective mapping of a single data source and thus allows simple quality checks.

The BioCASe Monitor

The BioCASe Monitor is the entry point interface of the BMS. It consists of informative tables for each registered provider. These tables contain, at the minimum, a list of data sources, their access points (URI of the particular BPS data source), total number of records and date of last modification. For any concept of the provided data schema, a column can be displayed with the count of total and distinct values (see Fig. 2). These are flexible tables that can be customized in the configuration file depending on required progress indicators for monitoring. For example, the BioCASe Monitor can be used for rapid assessment of counts of taxon names, multimedia objects and locality names. Furthermore, it allows the detection of erroneous duplication of a unique identifier field.

Figure 2. Each concept can be flexibly displayed with the total and distinct count of values.

The greater the number of data sources the greater is the necessity to group them into logical units. Therefore, the BioCASe Monitor offers the possibility of creating blocks of data sources, which can be configured as collapsible (and respectively expandable) boxes (see Fig. 3). As a result, the administrator is able to configure several independent groups of data sources with specific settings.