Description Language

The objective of the Search-WS Description Language is to allow searching of disparate on-line database through a simple common API. In essence it describes how to:

  1. Create a url that will be used to search the database.
  2. "Screen scrape" the response in order to get the records and any other useful information.

The current incarnation is very incomplete and is offered as as-is in the hope that giving it a public viewing will elicit some helpful criticism.

Describing an SRU service

The sample description that follows is for the Copac SRU service (line numbers have been added.)

01 <sws xmlns="">
02  <op>
03   <request href=";version=1.1&amp;query={query}&amp;maximumRecords={maxItems}&amp;recordSchema=info:srw/schema/1/mods-v3.0">
04    <param name="query" semantics="cql"/>
05    <param name="maxItems" value="25"/>
06   </request>
07   <response>
08    <set name="numberOfItems">
09     <xpath select="/zs:searchRetrieveResponse/zs:numberOfRecords" string-value="yes"/>
10    </set>
11    <set name="items">
12     <xpath select="//mods"/>
13    </set>
14   </response>
15  </op>
16 </sws>

In the above example, line 03 specifies the url that we are going to use to perform an http request. Note that it contains a couple of "special" strings: {query} and {maxItems} -- these are the place holders for the variable parts of the url that our API will populate for us.

You'll notice that line 05 describes a parameter called "maxItems" and its value is set to 25. This is a fairly pointless example, but it demonstrates a principle. Hopefully you can all guess that {maxItems} in the request url will get substituted with the value 25. In a real world example I would expect the API to provide a mechanism for programs to set this value -- however having it in the specification file allows a default value to be set.

You'll also notice that line 04 describes a parameter called "query" with the semantics of cql. The API has to provide magic to turn the user supplied query into a valid cql query -- which once done is pasted into the url at the {query} placeholder.

The <var> element on line 08 describes how to parse the number of records (or items) found from the search result. Line 09 describes an XPath that can be used to get the desired number. Alternatively, we could have used a regular expression, in which case we might have specified the following:

<set name="numberOfItems">
 <regexp regexp="&lt;zs:numberOfRecords&gt;(.*?)&lt;/"/>

Note that the regular expression looks messy because we have to quote the angle brackets in an XML attribute. In plain text the regular expression is:


If there is a bracketed expression within the regular expression then it is assumed that the value wanted is contained within the first bracket (or in Perl the \$1 variable.) If there is no bracketed expression, then I am unsure whether the result should be a boolean indicating a match or the whole of the matched text. I think both version would be useful and so maybe there should be another attribute to select one way or the other.

The <set> element at line 11 specified how to get at the result items (or records if you prefer.)

The Perl module

I've created a simple Perl module to test ideas as I've been developing the description. The following example is a small perl script that uses the Copac SRU description shown above.

01 use SWS;
03 my $sws = new SWS (filename => 'copac-sru.sws');
05 my %query = (
06  'au' => 'essery',
07  'kw' => '"midland railway"',
08 );
10 my @items = $sws->search (%query);
11 print "found ", scalar (@items), " items\n";
12 print "item[0] = ", $items[0], "\n";
14 foreach my $item ($sws->var_names) {
15     print "$item = ", $sws->var ($item), "\n";
16 }

In the above example, line 03 creates a new instance of the SWS class and will read in the description contained in the copac-sru.sws file.

Lines 05-08 is the query we are going to send to Copac (more about queries later...)

Line 10 calls the SWS::search() function which performs all the work of:

The return value of SWS::search() is an array of the items found in the response.

Lines 14-16 prints a list "variables" parsed from the response. Our example description for Copac shown above only specifies one variable "numberOfItems".

Describing a traditional CGI based service

The example description below describes the main Copac HTML form based interface.

<sws xmlns="">
  <request href="">
   <form action="/wzgw" method="get" name="Copac Quick Search">
    <param name="au" semantics="au"/>            <!-- General author field. -->
    <param name="ti" semantics="ti"/>            <!-- Title field. -->
    <param name="any" semantics="kw"/>           <!-- General keyword field. -->
    <param name="form" value="qs"/>
    <param name="fs" value="Search"/>
   <set name="numberOfItems">
    <regexp regexp="&lt;span id=&quot;num_hits&quot;&gt;([0-9]+)&lt;"/>
   <set name="sessionID">
    <regexp regexp="/wzgw\?id=([^&amp;]+)" occurance="1"/>
   <set name="resultSetName">
    <regexp regexp="&amp;amp;rsn=([0-9]+)" occurnace="1"/>
   <set name="isSorted">
    <regexp regexp="; sorted by:"/>
   <set name="sortOrder">
    <regexp regexp="; sorted by: ([^&lt;]+)"/>
  <request href="{resultSetName}&amp;format=XML+-+MODS&amp;id={sessionID}&amp;fs=Download+records"/>
   <set name="items">
    <xpath select="/modsCollection/mods"/>

The notable differences between this example and the first example are:

  1. The way the first request url is constructed. The above description attempts to describe the Copac search form.
  2. The result items are retrieved from a second http request. Note that the <items> are retrieved from the second request.

It is also interesting to note that the Copac session-ID and result-set-name are parsed from the initial response and then used to construct the url for the second request.

It is also worth mentioning that the <items> element could specify a regular expression to parse out the records (rather than an XPath.) If we wanted to use a regular expression, then something like the following would suffice.

 <regexp regexp="&lt;mods .*?&lt;/mods&gt;"/>


I think we need arrays to collect records/items. I don't want to introduce different types of variable -- which I think would be overkill for this application. So I'm proposing that all variables are actually arrays and would be used as follows:

 <set name="x" value="1"/> 
 <increment name="x"/>
 <append name="x" value="second item"/>
 <append name="x">
  <xpath select="//mods:mods"/>

The <set> element discards any previous value(s) that the variable may hold. The <append> element would append new values onto the end of the array. The value of the variables would need to be accessed in conditional expressions (see below.) To get the value of the first element, then:

 <if test="x = 1">

To test the number of values in an array:

 <if test="x.size != 10">

The <increment> element will increment the value of the first element of the array. You can specify an increment attribute to increment the value by a value other than 1.

Conditions and Iterations

The <if> and <while> elements are used for expressing conditional and repeated operations.

Contional operations

 <if test="some_variable &gt; 0">

If it is felt an if ... then ... else ... endif construct is needed then we should probably follow the XSLT example an introduce elements such as follows:

  <when test="some_variable &gt; 0">
  <when test="some_other_condition">

Repeated operations

For looping, a simple while loop could be done as follows.

 <while test="items.size != 10">

What still needs working on

To be continued... (Ashley.)

DescriptionLanguage (last edited 2009-08-12 18:05:16 by localhost)