4. BioCASe provider software installation and configuration

Home Manual

Download the presentation "BIOCASE PROVIDER SOFTWARE & BIOCASE MONITOR SERVICE" given by Falko Gloeckler at TDWG 2013 conference

Download the presentation "BioCASe_Workshop" given by Jörg Holetschek and Gabriele Dröge

in Berlin, 2011

Summary

In distributed data networks it is highly advantageous to use common standard schemas for data provision to aggregating data portals like Europeana, as every partner might have an individual data model. The BioCASe Provider Software (BPS) is a web-based middleware which provides the possibility to map a SQL database to conceptual XML data schemas like ABCD (Access to Biological Collection Data). It offers a web-interface which can be harvested by the aggregators by using the BioCASe Protocol for communication. The underlying individual databases do not need any structural changes and can be used as before.

 

Table of contents

1. BioCASe Provider Software 3.2

2. Checklist installation and configuration of the BioCASe web service

3. BioCASe as a web service wrapping databases

4. BioCASe installation (prerequisites and procedure)

5. Installing the BioCASe provider/PyWrapper on your computer

6. BioCASe on a virtual machine

7. BioCASe and its back-end database

8. IP, DNS and the connection of the BioCASe provider to networks

9. Editing and copy/pasting the configuration file

 

Further Reading

 

Go to top

XML and Darwin Core Archives are now just a click away with BioCASe 3.2. The development team at the Biological Collection Access Service (BioCASe) has released a new version of its BioCASe Provider Software. Information published using the new version, BioCASe 3.2, can be easily stored as XML or Darwin Core Archives.

With the new archives supported by BioCASe, harvesting and indexing processes will be more efficient and less error-prone. Data publishers can now easily switch between using the simpler Darwin Core standard and the richer Access to Biological Collections Databases (ABCD) schema; special networks relying on the ABCD schema can choose between using the traditional web services or the new XML archives. For more information on BioCASe and the new software version, visit http://www.biocase.org.

 

2. Checklist installation and configuration of the BioCASe web service

Go to top

Checklist: Installation of BioCASe

1

Install an Apache HTTP Server (v. 2.2).

 

2

Install Python.

 

3

Download and unpack BioCASe.

http://wiki.bgbm.org/bps/index.php/Installation

4

Run the installation script of BioCASe from the console ("python.setup.py").

 

5

Adapt the config file of Apache (httpd.conf or /site/<sitename>) with the output of the configuration script (setting of the alias and web directory).

BioCASe should already be visible from the web at this point, but additional libraries need to be installed.

6

Install the additional Python Packages and libraries (like DB drivers).

Local BioCASe > Utilities > Library test

7

Change the main password.

Config tool > System administration

Checklist: Registration and mapping of a new dataset

8

Create a new DSA.

Config tool > System administration > Datasources > Create DSA

9

Refine the database connexion.

Config tool >Datasource administration > Database Connection > Edit Connection parameters

10

Register aliases, primary keys and foreign keys for the tables and views to be mapped.

Config tool >Datasource administration > Database Structure > Edit DB Structure

Click "Save" +"Refresh" after the definition of a new table in order to access the Primary Key/Foreign Key parts.

11

Choose the mapping schema.

Config tool > Datasource administration > Schemas > List "Map your DB against a new schema.

12

Enter the mapping form.

Config tool > Datasource administration > Schemas > Link with schema name

 

3. BioCASe as a web service wrapping databases

Go to top

Related question: Does BioCASe store the data? Is it a database system? [1]

BioCASe is a web application installed on top of a database and exposes the data of the source database to the web. It is intended for databases containing curatorial, taxonomic, and ecological information. BioCASe does not store any data itself (unlike Access, Excel, or BRAHMS) but it is dependent on the external source database system that contains the data (SQL or ODBC).

All queries and results containing data are exchanged on the Internet and both are standardized into a common XML format. By exchanging queries and data in the XML format, applications using BioCASe are independent from the technical implementation and structure of the original database. BioCASe is thus an XML over HTTP application that enables the client to ask the data to be 'structure-agnostic' towards the structure of the server-side database containing the source data. BioCASe:

enables users to define a correspondence between the structure of their database and the structure of the XML schema,

provides an XML over HTTP interface to interpret queries (send in the BioCASe format as HTTP parameters or as XML) and return the results as XML document,

features a graphical search interface (called ‚QueryTool’) that converts the XML into a human-readable representation.

Several BioCASe providers can provide data to a network whose portal acts as a central gateway to all source databases. BioCASe is already being extensively used as a data provider for the GBIF network and the BioCASe network. These networks are semi-decentralized: part of the data are harvested and "cached" into a central database belonging to the network, other information is queried via live requests sent to the provider. Special interest networks using BioCASe include the Australian herbaria network (using the HISPID extension of ABCD) and the DNA Bank Network (using ABCD-DNA).

 

4. BioCASe installation (prerequisites and procedure)

Go to top

Related question: How do I install BioCASe? [1]

In order to install the BioCASe provider software (also known as "PyWrapper”) you need to have:

  • Python installed on your computer,
  • an HTTP server (preferably HTTP Apache) with its Python module,
  • an SQL database system (MySQL, PostgreSQL, Oracle, MS SQL server, ODBC, etc...),
  • the Python libraries for your SQL database (the PyWrapper can guide you for their installation).
  • Some other Python libraries are needed (e.g., to use a an internal search engine called the QueryTool) but can be installed after the installation of the main BioCASe package

BioCASe can be installed on Linux, Windows, and Mac. The installation process is divided into three major steps:

  • run the script "setup.py" located at the root of the web folders,
  • update the configuration of your Apache HTTP server (the installation script generates a sample configuration to be copied and pasted),
  • use a module of the BioCASe provider software that checks that all the needed libraries are installed.

For a more complete description of the installation process, please see the documentation at the PyWrapper Wiki.

5. Installing the BioCASe provider/PyWrapper on your computer

Go to top

Related question: Can I install the BioCASe provider/PyWrapper on my own computer? [1]

It is technically possible to install BioCASe on your own computer but it is suggested to do so only for testing purposes. As the BioCASe provider is a web service that must be connected to a network, it should be installed on a server with a permanent Internet connection, static IP, and domain name.

If you do not have direct access to the servers of your institution, it is suggested to:

  1. install a running BioCASe on the server, and leave it empty at this stage,
  2. ensure that you also have a database server ready on your computer,
  3. install
    1. BioCASe pywrapper,
    2. HTTP Apache,
    3. a copy of your SQL database on your own computer. Eventually you can directly connect the BioCASe provider installed on your local computer to the database of the institution if your network administrator allows you to do so.
  4. Define a first ABCD mapping on your local computer.
  5. Once the mapping is functional on your own computer (it is available through http://localhost/<myBioCASeWebfolder>):
    1. You can copy your database to the server of your institution or ask your colleagues in charge of the server to do so. If possible do not rename your database.
    2. Copy the config folder of your own BioCASe dataset to the one located on your institution server:<BiocaseFolder>/config/datasources/<Name_of_the_datasource>)
  6. You will probably have to correct the connection settings (user names and password) from your local database to the server database (see Fig. 3) as last step.
  7. If the two databases have the same name and structure and they run on the same SQL server, BioCASe should be functional at this stage.
 

Fig. 3 Editing PyWrapper connection parameters

 

6. BioCASe on a virtual machine

Go to top

Related question: Can I install the BioCASe provider/PyWrapper on my own computer? [1]

As BioCASe requires the installation of several low-level libraries (Python database drivers, XML libraries, GraphViz to visualize your database structure), it could be interesting to install BioCASe as a guest of a virtual machine. This virtual machine can be moved between your local computer and the server of the institution as a single package.

By doing so the installation of the libraries, the database, the configuration of Apache and definition of mapping occur only once. However, you have to check the network configuration (definition of IP and of machine name) when you move your virtual machine from one server to another. A server for virtual machine requests a substantial amount of resources but can be installed on modern PCs and laptops. Virtual machines greatly ease backup, migration of data, and copying between several servers.

Several servers for virtual machines are available, amongst others:

VirtualBox (ex Sun, now Oracle, free, features a graphical interface)

OpenVZ (free and Open Source but no native Graphical interface, although plug in exists):

VMWare (commercial with a graphical interface to administer. Some major components are freely available)

Most of these virtual servers recognize the OVF (Open Virtualization format) to allow cross-platform replication of a virtual machine.

 7. BioCASe and its back-end database

Go to top

Related question: How do I connect an SQL database to BioCASe? [1]

BioCASe is intended to work only with SQL databases and it connects these databases by means of a low-level software called ‘driver’ (that must be installed on the server as a prerequisite).

Fig. 4 Configuration page of the BioCASe (version 2.6.0) provider for connecting to an SQL database by generating the appropriate connection string. The drop-down list presents the recognized drivers for SQL databases.

 

The database management system MS Access is compatible with SQL but unfortunately this is not as easy with MS Excel. It is possible, however, to connect an ODBC driver to Excel and thus make it compatible with SQL (performance in the network, however, will still be very poor). Several desktop database applications intended for specific biological work (curatorial management, definition of taxonomic keys) exist that are not compatible with SQL in their native state. These often have a very good interface that allows an easy visualisation of the scientific content. To use these (or other non-SQL databases) and publish your data with BioCASe, you will have to build a replica of your database (or of the part you want to publish) in a SQL system, and transfer the data between the two systems, e.g. in tab-delimited or CSV-format[2].

The difference between the encoding used in the source database, the intermediate CSV document, and the destination collected to BioCASe is a classic source of technical problems in this process (data can be published with unreadable characters replacing diacritic marks). Please ensure that the CSV documents and the database share the same encoding format (UTF-8, LATIN-1, etc.)

Fig. 5 Full workflow of the submission of data to OpenUp! from a source desktop database to the portal. The BioCASe provider intervenes in the 2nd step.



[2] CSV (comma-separated values) and tab-delimited formats are two ways to format structured content in order to store tabular information (field-delimited sheet of data featuring rows and columns) into a simple text document that can be opened by a simple text editor. Commas, semicolon, tabs or other characters are used to represent columns in the document). Most of database systems and spread sheet software can export and import data in this format.

 

 8. IP, DNS and the connection of the BioCASe provider to networks

Go to top

Related question: How do I configure my server to let users access BioCASE? [1]

Once a user has configured a dataset with the BioCASe provider, he receives what is called an “access point”. This access point is a URL identifying both the server containing the provider software, and the name of the resource. There is one different access point for each different collection registered in the provider. This URL is the gateway to the collection and must be communicated to portals and networks so that they can access your data together with those of the other providers by means of a common search interface (like GBIF).

The access point can be accessed by clicking on the link containing the dataset name from the first page of BioCASe. In the example (see Fig. 6) the access point is: http://gbif.africamuseum.be/biocase_rmca/dsa_info.cgi?dsa=demo_openup

This access point is supposed to be permanent (or ‘as permanent as possible’). For this reason we suggest that users of the BioCASe provider willing to contribute their data to a federated network have at least an Internet domain name and a static IP at their disposal, and use the domain rather than the IP (that may change more often in time) in the access point URL.

Please remember that two services are intervening when resolving the URL of a BioCASe access point (like for most of the websites):

  1. The DNS of your institute that must first resolve the domain of the provider. This is the left part of the URL before the first single ‘/’, in our example: http://gbif.africamuseum.be
  2. The HTTP server configured inside of this server that resolves the remaining part of the URL (/biocase_rmca/dsa_info.cgi?dsa=demo_openup). This server will most likely be Apache - an open source HTTP server for webpages. The HTTP server redirects the right part of the URL to the appropriate website made available within the domain.

The BioCASe provider contains an installation script that also helps to configure the Apache HTTP server in an appropriate way. This script is called ‘setup.py’ and must be run in DOS console (Microsoft system) or shell in Linux-based systems. Beforehand, the DNS needs to be configured (normally this is the case if the server you use already provides webpages to the public).

NB: it is possible to link several domains to a single IP address. That situation requires that the system administrator pays special attention to the good synchronization between the DNS and the Apache HTTP server. The domain names used in the configuration of both services should be the same.

 

Fig. 6 The BioCASe page providing the Access point to a dataset (i.e. the URL to be communicated to the network operating the portal where data are provided)



 

9. Editing and copy/pasting the configuration file

Go to top

Related question: Can I replicate datasets between several implementation of the BioCASe provider? [1]

In case you plan to map several databases, it is suggested to create smaller datasets, where you can refine the mapping, instead of a big mapping file containing all the tables. If your database tables and views all share the same structure and field names, you can relatively easy generate smaller datasets from a big one.

  1. In your file system go to <BiocaseFolder>/config/datasources
  2. Copy the main subfolder with the name of your original dataset into another folder having the name of the new dataset you want to create (see Fig. 7).

Fig. 7 Folder where a datasource is located

  1. The folder "NameOdMyDataset" matches the name of the dataset in your Access point URL (e.g.: http//localhost/biocase/pywrapper.cgi?dsa=NameOfMyDataset) Copy/paste the subfolder will create a valid BioCASe dataset with the same structure and the name of the subfolder.
  2. in your SQL database, crate a view which has the same structure (same number of fields and field names) as the table (or view) you connected previously, with the appropriate filter in its “WHERE” clause
  3. Open your new dataset in BioCASe, go to “configure” and then “database structure”-> “edit database structure”
  4. in the column ‘Table’, replace the name of your previous table with the one of your new view and save your configuration (see Fig. 8).

This new dataset should be accessible and visible in the list of DSA of the provider.

The same procedure can be applied to move or replicate the configuration of a dataset from one instance of the BioCASe pyWrapper to another one (don't forget to move the data and/or to update the setting of the connection to the database in parallel).

 

Fig. 8 Editing database structure