Important: what should I check in BioCASe if some records are lost during OpenUp harvesting?

1 post / 0 new
Franck
Franck's picture
Offline
Joined: 2011-04-01
Important: what should I check in BioCASe if some records are lost during OpenUp harvesting?

One of the most common problem occurring during the harvesting phase of a BioCASe provider is "the dropped record" issue: a dataset can be only partially harvested and some content is left out of the process.

It can be difficult to track the affected records (i.e; the rows in the source database that cannot be seen by the indexing service), especially with large datasets.

The identification of the harvested records, the testing and debugging can then be time consuming activities, involving tests and controls to be undertaken at the level of the provider, the helpdesks and the technical managers of the harvesting service.

 

This document summarizes the observations gathered about this issue and give some recommendations to identify and fix the source of the issue.

 

1 Some words about the harvesting

 

The efforts to fix this issue will often focus first on the mapping of scientific names, or on the way they are stored in the database.

Let's say some words about the harvesting process itself to better introduce to this problem.

The harvesting service used by OpenUp (the GBIF HIT ) indexes the records  by sorting them alphabetically on the scientific names.

 

The HIT thus:

1) first sends a scan request (search on one unique database field) on the provider to retrieve an inventory of the scientific names:

The output will be an alphabetically sorted list of names:

"

Abies alba Mill.

Abutilon theophrasti Medik.

Acantholion asyriacus Boiss. var. micrame Nabalek

Acer campestre L.

Aces platanoides L.

[...]

Fraxinus americana L.

Fraxinus angustipholia Vahl.

Fraxinus angustipholia subsp. danubialis Pouzar

Fumaria officialis L.

Gagea lutea (L.) Ker Gawl.

"

2) the HIT then sends a series of search requests (searches on all mapped fields) sorting batches of records alphabetically.

More precisely, the HIT A) sends BioCASe queries retrieving batches of records corresponding to "name intervals" where an interval can consist of 10 names for example. B) Each “name interval” is sent several times to retrieve consecutive batches of 100 records until no more records for that particular interval is found:

 

Example:

Query 1:

 “search records  above or equals to “Albie alba Mil.” and lower or equals to “Acetosa thyrsiflora (Finerh.) Á. Löve  et D. Löve” between record 0 and 99”

 

Query 2:

“search records  above or equals to “Albie alba Mil.” and lower or equals to “Acetosa thyrsiflora (Finerh.) Á. Löve  et D. Löve”  between record 100 and 199”

 

Query 3.a: if the query 2 had retrieved at least 1 value :

“search records  above or equals to “Albie alba Mil.” and lower or equals to “Acetosa thyrsiflora (Finerh.) Á. Löve  et D. Löve “between record 200 and 299”

 

.Or query 3.a: if  the query n° 1 retrieved 0 values:

“search records  above or equals to “Acetosella tenuifolia (Wallr.) Á. Löve ” and lower or equals to “chillea nobilis L.” between record 0 and 99”

 

3) the gathered ABCD records are stored on the server hosting the HIT, and optionally synchronized into a MySQL database.

 

4) Further activities can then be undertaken by the manager of the harvesting node (change the format of the data)

 

 

2 What should you check if you’re confronted with a “lost  record issue”?

 

If you are confronted with such an issue ensure that …

 

-1: your version of the BioCASe provider is above or equal to version 2.6.1 (the HIT may use the comparison operator IsNull which is not recongnised in earlier versions). Generally speaking, it is recommended to use the latest version of the BPS  (3.3 in January 2013), that features some improvements facilitating the mapping of database fields. See also: http://open-up.cybertaxonomy.africamuseum.be/forum_topic/upgrading_biocase_installation

 

-2: the columns containing scientific names do not contain misspelled scientific epithets (beginning with a capital letter)

 

E.G a record linked to:

“Accipiter Novaehollandiae (J. F. Gmelin, 1788)” 

 

(uppercase character in the specific epithet) could not be seen by the HIT when they are provided by version of  the BPS earlier than 3.3 , while...

“Accipiter novaehollandiae (J. F. Gmelin, 1788)” 

will be accepted.

This problem was related to a conflict between some DBMS and Python when sorting data alphabetically in a case-sensitive way. As of version 3.3, all sorting will be done by the DBMS and this problem should not occur anymore

 

-3: the encoding of your database is consistent: all the fields of the database must have the same encoding setting or codepage.

e.g  : do not mix  UTF-8 values with  other encodings such as latin1 [also known as 1252] with  cp1251 etc ).

Pay attention if your database merges imported tables from different origins! That may be a cause inconsistent encodings in the database

 

-4: check that the configuration of the connection to the database in BioCASe is the same as the encoding setting of the table (that can be different as the default encoding of the database).

 

Note 1: in some versions of MySQL (the same issue may appear with other database systems), the collation of a table can be different from the default collation of the main database if the SQL code to create the table has been imported from an external script.  In this case:

-Check that the SQL code creating the table doesn’t contain instructions specifying an encoding at field level which is different as the main setting of the database.

-Check the encoding setting of the file containing the import script (ideally it should be the same as with the database). The Windows Notepad or Gnome GEDIT programs allow the user to change the encoding of a file when the document is saved.

-avoid importing batches of data via web interfaces (like PHP MyAdmin).

 

Note 2 :By default, MySQL uses the latin1_swedish_ci collation (corresponding to cp1252 in the BioCASe configuration page  for database connection)

 

-5 the string value of the database must not contain XML, HTTP or HTML breaking characters (e.g:   %, $, unescaped “\” etc…) , especially in the scientific name field.

They can cause problems for building up name ranges in the HIT, and even break the response of the BPS to queries. Sometimes database interfaces will automatically escape theses characters, making them difficult to identify.

The best approach is to search for these characters explicitly in the WHERE clause of an SQL query

-6: Ideally, values like scientific names, localities; countries should not contain leading or trailing blank characters.

You can check this by running this SQL query:

“SELECT id, scientific_name FROM  myTable WHERE LEN(scientific_name)<>LEN(TRIM(scientific_name));”

And fix the issue by

“UPDATE  myTable SET scientific_name=TRIM(scientific_name);

Note: untrimmed values may also prevent the optional Graphical Query Tool of the BPS to work completely.

 

-7: As the harvesting service only works with HTTP GET requests, you should check that the settings of your your Internet server and/or your firewall allows URLs with long query strings (around 5000 characters), especially if your datasets contain long scientific names. This is especially sensible if you use a IIS 7.5+ server, that allows only 4096 bytes for the full URL (see also http://open-up.cybertaxonomy.africamuseum.be/forum_topic/issues_when_installing_biocase_provider_iis_server).

Apache offers a bigger limit for the maximum size of URLs (see http://httpd.apache.org/docs/2.2/mod/core.html#limitrequestline for its eventual modification).

Thanks to Falko and Odo for having identified and documented the issue.

-8: if the problem persists we recommend to change the debug level of the BPS and set it from “info” to “warn” or “error” .

Versions of the BPS prior to 3.3, can forward malformed characters  in the log message displayed at the end of the output XML

 

 

Notes:

 

A) What is the difference between a codepage (or character set) and a collation?

A character set is a list of symbols. If you compare ASCII to latin1, with latin1 you will be able to write all American words because latin1 contains all ASCII characters, which are sufficient to write any American word. On the contrary, with ASCII you will not be able to write all words of Western European specific languages, because for instance characters like 'À', 'ë', 'õ', 'Ñ' are missing. A character set has an influence on the update or insertion of new data (value with unrecognized symbols will be rejected)

 

Collation is about comparison between characters. It defines a set of rules to compare characters of a character set. For instance: tell the system to separate lowercase from uppercase or ignore the difference with accented characters and non-accented characters. It has no impact on the insertion and updating of data, but a strong impact on searching/matching.

In MySQL, collations are often related to one language (e.g. 'latin1_swedish_ci', 'latin1_german1_ci', etc.). When you order a select query, a word starting by 'ö' will be placed between two words starting by 'o' and 'p' in some languages (with some collations). But with another collation, this character may be placed completely at the end, which makes the resulting selection different.

(adapted from:

http://programmers.stackexchange.com/questions/95048/what-is-the-difference-between-collation-and-character-set)

 

Some additional remarks;

-most of the time one character set is associated to several collations (case sensitive or case insensitive, accent sensitive or not etc…)

- It is often possible to dynamically change the collation of comparison in a search, by using some options in the SQL query (but it will have an impact on the performance of the system)

-most DBMS initialize a new database on a default character set and a default collation (in MySQL: 'latin1_swedish_ci ). check this in the documentation. It is often possible to update the collation of a database in a relatively easy way, but not with the character set.

 

B) What is the difference between a codepage and a character set?

 

-the notion of codepage used by Oracle and Microsoft is very similar to the notion of character set.

But the range of a codepage can be different from the range of the character set it originates from. For instance Microsoft “Codepage 1252” is more or less similar to the ISO-8859-1, except that CP 1252 doubly defines character like “  and the ‘ in the unused parts of ISO 8859-1. Consequently a database defined in ISO-8859-1 will be compliant with a driver configured to connect a database in CP 1252, while the opposite is not true!