Tuesday, 2 March 2010

ALTO Viewer

I've just put some code on Github for a very simple ALTO Viewer which I knocked up to debug my AbbyyFineReader to ALTO XML conversion scripts (awaiting confirmation from Abbyy before I release as Open Source). It's a very simple page which scales the vertical and horizontal asxis of the source image and plots transparent layers for the Strings, TextLines, TextBlocks and PrintSpace elements. I've not added full features of ALTO yet as we don't use them ourselves but it should be trivial.

Requires PHP5 and uses JQuery 1.4 (bundled)

Configuration is currently done in the index.php file where you should provide 2 arguments to the constructor to point it at your alto and image files.

$altoViewer = new AltoViewer( '/Users/dof/Development/Source/altoview/alto',
$image, $vScale, $hScale);
Modify the first 2 arguments as required.

Use the toggle buttons in the right hand menu to toggle individual layers on and off as required. You can have any combination of layers selected at any given time.

Head on over to GitHub to grab the code and have a play.

Friday, 30 March 2007

Wales on the Web - Gone but not forgotten

Last night we turned off the apache process for Wales on the Web. This was the project I was initially hired to work on here in The National Library of Wales, but the project ran it's course and it was time for us to injest the data into our Virtua system. I wrote MARC21 export scripts for the Wales on the Web Postgres relational DB Schema and we loaded the data into Virtua. By this point it was pointless to continue to run the old website alongside the data now available in our iPortal. So, last night I shut it off.

However, all that added value content from the old website has not been lost.

A few simple lines of PHP exported all of the static ontent pages held within the DB as iPortal template documents:

$sql = "SELECT * FROM lu_pages";

$records = pg_query($sql);

/* OK time for the logic. Fill a hash with the values */
while ($records_list = pg_fetch_assoc($records)) {
$filecontents = "\n";
$filecontents .= "


$filecontents .= $records_list['page_content_en'];
$filecontents .= "\n";
$filename = "/tmp/guides/".$records_list['page_id']."-en.html";

echo "writing: {$filename}";
$res = fopen($filename,"w");
fwrite ($res, $filecontents);
fclose ($res);

// file_put_contents($filename,$filecontents);

$filecontents = "\n";
$filecontents .= "


$filecontents .= $records_list['page_content_cy'];
$filecontents .= "\n";
$filename = "/tmp/guides/".$records_list['page_id']."-cy.html";
// file_put_contents($filename,$filecontents);

echo "writing: {$filename}";
$res = fopen($filename,"w");
fwrite ($res, $filecontents);
fclose ($res);



And some mod_rewrite rules on the iPortal server allowed us to resolve all of the old Wales on the Web URLs that people have linked to over the years.

# mod_rewrite rules used to keep Wales on the Web URLs valid
RewriteEngine on

# Rewrite English language domain to English e-Resources SKin
RewriteCond %{HTTP_HOST} www.walesontheweb.org$
RewriteRule ^$ /cgi-bin/gw/chameleon?lng=en&skin=eresources [L,R=301]

# Rewrite Welsh language domain to Welsh e-Resources SKin
RewriteCond %{HTTP_HOST} www.cymruarywe.org$
RewriteRule ^$ /cgi-bin/gw/chameleon?lng=cy&skin=eresources [L,R=301]

# Add trailing slashes to URLs
RewriteCond %{REQUEST_URI} ^/[^\.]+[^/]$
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R=301,L]

# Rewrite the CAYW Guides to use the vtls_link DIPORewriteRule ^cayw/guides/([^/\.]+)/([^/\.]+)/?$ /cgi-bin/gw/link/vtls_link.pl?file=http://cat.llgc.org.uk/gw/html/eresources/$1/static_content/guides/$2-$1.html&$3&skin=eresources&lng=$1 [L]

# Rewrite CAYW Dewey Decimal URLs to a Dewey Search on the e-Resources skinRewriteRule ^cayw/index/([^/\.]+)/([^/\.]+)/([^/\.]+)/?$ /cgi-bin/gw/chameleon?lng=$1&search=FREEFORM&function=INITREQ&elementcount=1&t1=dc:$2.$3&skin=eresources

# Rewrite CAYW Dewey Decimal URLs to a Dewey Search on the e-Resources skinRewriteRule ^cayw/index/([^/\.]+)/([^/\.]+)/?$ /cgi-bin/gw/chameleon?lng=$1&search=FREEFORM&function=INITREQ&elementcount=1&t1=dc:$2&skin=eresources

So, now you can look at the Curriculum Cymraeg at it's original address:


or look at a list of Dewey Decimal catalogued websites for National Assembly Government Members:


I hope these little titbits are useful to someone else out there at some point. They have served me well in these past weeks.

Tuesday, 6 March 2007


If you'll pardon the pun, it is black and white now.

Ever since I started working in the library sector, I have wondered why many IMS/ILS choose to store MARC within a BLOB rather than creating a relational or object ornentated schema for it. Was the schema too complex? It did not seem particularly complex to me in my head. Afterall, most RDBMS provided all the datatypes we could need to store anything which I have come across in MARC21.

Yesterday, I decided to go in search for the answer once again, but quickly got sidetracked. After a quick google for "marcxml relational schema" I came across exactlythe question I had always been asking myself on the XML4Lib List, whcih although I am subscribed to, I must admit I have not had the time to read in the past year. The thread, goes on to discuss why MARC is not the greatest of datatypes to map to a relational schema, thus partially answering my question above, but moreso, it goes further to discuss the concept of XML databases. Marc Cromme provided the timely post which would send me on my merry way to a world of enlightenment. Just last week I had sat through a demonstration of gsearch for Fedora, which struck me as impressive to say the least. I noticed that this demonstration used either Lucene or Zebra under the skin. Now Zebra was popping up again for the second time in as many days. I was intrigued.

I went on to read the documentation for Zebra, and talked it over with a collegue who tried to talk me into Lucene rather than follow the path of the Zebra, but I think this was mainly a language prejudice (he's a Java Developer). I stuck to my guns initially anyway, and decided last night to see what I could get running last night.

One of my work tasks over the past couple of weeks has been to polish off the data export from a website which I had been working on over the past few years. I had taken it's original Postgres relational schema and mapped it to MARC21 for import into our new IMS. With this little dataset to hand (~4500 records), I set about trying to make it searchable.

Firstly, I wanted to move from MARC21 IS02709, to MARCXML. A quick instalation of the marc4j toolkit solved this in a matter of 400 miliseconds.

Now with my much more flexible MARCXML, it was time to find a new storage mechanism for it. How could flat files possibly be as quick as a decent relational structure, and how would you query them, I had always wondered. Zebra steps up to the plate and answers these questions one by one.

Instalation could not have been simpler. On my Mac Pro, I did a build of the source in /usr/local/src (after downloading and building yaz as a prerequisite). Standard ./configure; make; make install and we're good to go. The online manual, really gets you up to speed quickly and after having indexed the sample datasets I was firing up yaz and querying the database with z39.50! Knowing that Zebra was written for Libraries especially, I presumed that there would be some MARCXML examples lieing around somewhere and I was right. Within the distribution was a full MARCXML setup which I tested once as per the instructions, then made a copy of, and replaced the source MARCXML files with my own from the website export. Dropped it into place, ran the indexer and lo and behold, there I was, making Z39.50 queries against my data only an hour after having sat down at the computer.

To finish off the night, I added PHP5 to my system (It's a new toy and I hadn't installed it yet) , and upgraded Apache 2, then wrote a PHP/YAZ Web based search facility.

From data to web based fully indexed searching in an hour and a half. Colour me impressed. I can see myself focussing heavily on these newer technologies now.

Saturday, 3 March 2007

PHP MARC Library goes Super-Alpha

Aside from the fact that I love the idea of a Super-Alpha software release, I am excited to announce that Dan Scott has released his attempt to PEARify the existing PHP-MARC package which was develpoed by Christoffer Landtman at Realnode Ltd. as part of the Emilda ILS project. Whereas PHP-MARC was based around the existing CPAN MARC_Record Perl library, File_MARC is a rewrite utilising only the core algorithms from MARC_Record and PHP-MARC. At a glance, the code looks cleaner, and the class structure appeals from the outset. As a keen user of PHP-MARC, I look forward to testing this Super-Alpha code one some of my recent Data migration scripts.