Cataloging Spatial Assets – A Metadata Script Approach

Years ago I started writing some scripts with the end goal of doing a wholesale migration of my data into open source spatial databases. I since left that job and didn’t have much need for the scripts, but recently picked them back up again. I never made it to the migration part, but decided instead to focus on cataloging my GIS data so I could then create other apps to use the catalog for looking things up, creating overview maps, and ultimately to fuel migration scripts. The historical name for this project is somewhat irrelevant, for now, the Catalog portion of a broader plan I had called the Network Mapping Engine (NME).

My current efforts are available here: https://github.com/spatialguru/NME/tree/master/nme/cat

Want to try it? Ensure you have the prerequisite xmlgen and elementtree_pretty libs (included on the site) as well as a recent GDAL/OGR install with Python bindings. It’s reported to work well with the OSGeo4W distribution, though I have only tested it recently on Mac OSX 10.6.

The main feature is the gdalogr_catalogue.py script but the above dependencies will also have to be available in your PYTHONPATH.

Quick Start

The simple command line usage only requires one parameter – pointing to the folder that you want to catalog. It will also recursively scan the sub-folders as well.

  python gdalogr_catalogue.py -d /tmp/data

The result is basic XML (more about the details of output in a moment):

<?xml version="1.0" ?>
<DataCatalogue>
  <CatalogueProcess>
    <SearchPath>
      /tmp/data
    </SearchPath>
    <LaunchPath>
      /private/tmp/data
    </LaunchPath>
    <UserHome>
      /Users/tyler
    </UserHome>
    <IgnoredStrings>
      ['.svn', '.shx', '.dbf']
    </IgnoredStrings>
    <DirCount>
      2
    </DirCount>
    <FileCount>
      16
    </FileCount>
    <Timestamp>
      Mon Mar 19 14:42:35 2012
    </Timestamp>
 .... huge snip ....

Pipe the output into a new file or use the save-to-file output option:

  python gdalogr_catalogue.py -d /tmp/data -f output.xml

About the XML

Most metadata standards include pretty high level information about datasets, but the focus of this project is to grab as low level data as possible and make it easily consumable by other applications. For example, consider how valuable hierarchical information about all the datasets on your system or in your enterprise would be. A map viewing application could access the catalog instead of having to repetitively scan folders and open files. It could fuel part of a geospatial search engine for that matter!

The original work created CSV text files, then it produced SQL statements, but now it has final settled on producing XML. There was no existing standard for this type of metadata, so I had to create one. Because it is built on top of GDAL/OGR, much of it parallels the GDAL/OGR API, but additional information is also captured about the process of cataloguing and file statistics. It’s still changing and growing but no huge changes are expected.

There are three primary sections in the root DataCatalogue XML element, and a couple sub-elements as well:

  1. CatalogueProcess – captures details of how the script was run, the operating system, timestamp, etc.
  2. VectorData – vector dataset details, format, location
    1. FileStats – information about the file or path, including user that owns it, modification date and a unique checksum
    1. VectorLayer – for each layer, spatial data details, format, extent
  3. RasterData – raster dataset details, format, location, dimensions
    1. RasterBand – for each band, data details, format, min/max values

I won’t review all the levels of elements but instead suggest you have a look at the sample_output.xml in your browser, along with the catalogue.xsl and css/catalogue.css, it renders to show some basic example data:

To have similar rendering, copy over the header/declaration details from the sample_output.xml to it points to the css and xsl file.

 

Future?

One of my goals is to have this XML output plug into other applications as the basis for an enterprise data catalog environment. My first goal is to add capabilities for generating a MapServer map file from the metadata. This would also allow the script to produce overview maps of the dataset extents. Another idea is to create a QGIS plugin to visualise and browse through the datasets, while also allowing option loading of the layer into QGIS.Have a better idea? Let me know and we can discuss it. Maybe I need to come up with a contest, hmm…

Updating the XML is another challenge on the horizon. I built in the creation of md5 hash based checksum so it can check for file changes without even having to open the file with GDAL/OGR, but there are no routines yet to allow it to update the XML in-place.

Obviously, there is more to come, as more people try it and let me know how they get on. Drop me a comment or email and I’d love to help you get it running if it fails in your environment!

Published by

Tyler Mitchell

Product and marketing leader, author and technology writer in NoSQL, big data, graph analytics, and geospatial. Follow me @1tylermitchell or get my book from http://locatepress.com/gpt

%d bloggers like this: