Tuesday, 6 March 2007

Zebra

If you'll pardon the pun, it is black and white now.

Ever since I started working in the library sector, I have wondered why many IMS/ILS choose to store MARC within a BLOB rather than creating a relational or object ornentated schema for it. Was the schema too complex? It did not seem particularly complex to me in my head. Afterall, most RDBMS provided all the datatypes we could need to store anything which I have come across in MARC21.

Yesterday, I decided to go in search for the answer once again, but quickly got sidetracked. After a quick google for "marcxml relational schema" I came across exactlythe question I had always been asking myself on the XML4Lib List, whcih although I am subscribed to, I must admit I have not had the time to read in the past year. The thread, goes on to discuss why MARC is not the greatest of datatypes to map to a relational schema, thus partially answering my question above, but moreso, it goes further to discuss the concept of XML databases. Marc Cromme provided the timely post which would send me on my merry way to a world of enlightenment. Just last week I had sat through a demonstration of gsearch for Fedora, which struck me as impressive to say the least. I noticed that this demonstration used either Lucene or Zebra under the skin. Now Zebra was popping up again for the second time in as many days. I was intrigued.

I went on to read the documentation for Zebra, and talked it over with a collegue who tried to talk me into Lucene rather than follow the path of the Zebra, but I think this was mainly a language prejudice (he's a Java Developer). I stuck to my guns initially anyway, and decided last night to see what I could get running last night.

One of my work tasks over the past couple of weeks has been to polish off the data export from a website which I had been working on over the past few years. I had taken it's original Postgres relational schema and mapped it to MARC21 for import into our new IMS. With this little dataset to hand (~4500 records), I set about trying to make it searchable.

Firstly, I wanted to move from MARC21 IS02709, to MARCXML. A quick instalation of the marc4j toolkit solved this in a matter of 400 miliseconds.

Now with my much more flexible MARCXML, it was time to find a new storage mechanism for it. How could flat files possibly be as quick as a decent relational structure, and how would you query them, I had always wondered. Zebra steps up to the plate and answers these questions one by one.

Instalation could not have been simpler. On my Mac Pro, I did a build of the source in /usr/local/src (after downloading and building yaz as a prerequisite). Standard ./configure; make; make install and we're good to go. The online manual, really gets you up to speed quickly and after having indexed the sample datasets I was firing up yaz and querying the database with z39.50! Knowing that Zebra was written for Libraries especially, I presumed that there would be some MARCXML examples lieing around somewhere and I was right. Within the distribution was a full MARCXML setup which I tested once as per the instructions, then made a copy of, and replaced the source MARCXML files with my own from the website export. Dropped it into place, ran the indexer and lo and behold, there I was, making Z39.50 queries against my data only an hour after having sat down at the computer.

To finish off the night, I added PHP5 to my system (It's a new toy and I hadn't installed it yet) , and upgraded Apache 2, then wrote a PHP/YAZ Web based search facility.

From data to web based fully indexed searching in an hour and a half. Colour me impressed. I can see myself focussing heavily on these newer technologies now.

1 comment:

chris said...

Zebra is really cool, have you seen zoomopac.liblime.com?
Koha now uses Zebra for its bibliographical search engine, which allows for all sorts of neat things proximity, stemming, relevance ranking (where the field weightings are defined), and much more.