A Case For Low-Level Metadata Collection
In my last post I mentioned the use of a script for mass cataloguing of dataset assets and the production of an XML document describing them. Mateusz justifiably asked for more of the purpose behind such a project, so here is some of the background that brought me to this point.
At the simplest level, when you first start collecting geospatial data it is easy to lay out a tiny organisational structure - a few folders here and there for file-based stuff, a naming convention for tables in a database, etc. As your collection grows and you begin to have a lot of geospatial data you can quickly move into the realm of ignorance about your data. This gets worse when you start to do analysis projects and make copies of (or modified versions of) data for a particular need. Data can breed like rabbits once you've done a few.
At my first two geo-jobs, I inherited a ton of geospatial data, or at least it seemed like it at the time. This collection was split up into 400+ mapsheets, with over a dozen datasets in each one, all at a scale of 1:20,000. (Not to mention the associated tabular data that accompanied this collection) Soon I was deriving new products from these data and redistributing them for various R&D projects or using them for map production purposes.
When we eventually were given an internal tool that could catalogue our datasets, I was excited. Unfortunately it used a GUI that would allow you to enter in your metadata one layer at a time. This would not help with our legacy collection. I knew enough SQL to be able to inject the information into the metadata catalogue, but even then, how could I harvest the metadata of 20,000+ layers in the first place?
So, to make a long story even longer, that is when I started trying to script the collection of this metadata. Once I started, I couldn't stop! Not only did I want the name and location of files, but I wanted creation dates, feature counts and types, geographic extent of features, etc. All things that I can now easily report on.
This early work was simply for finding and counting all the data. Now after a few years of building or using web applications, I see another value for having this information. I see it serving as an intermediate layer between data and applications - something that could be used to help automate the creation of services and to allow new users to find their data easily. I expect to see this approach used as the core for migrating enterprise collections into open source infrastructures.
Here are two examples. My early plan was also to collect the metadata and then run batch importing of these data into PostgreSQL/PostGIS. In effect migrating from our ArcInfo file-based tools into something more openly accessible to various higher level applications. I never got this far, but still believe it could be a valuable approach for mass migration - or even prototyping the migration process - from proprietary to open source.
The other example I think of are web based mapping services, WMS for example. I crafted many MapServer configuration files to serve up my information. Using tile indexes helped me do a lot of the work effectively, without having to do much with my data. After a while I started to wish I could automatically generate these configuration files and, in effect, stand up a really simple set of OWS for our data.
Now I'm not really managing any spatial data but I still see some need and a good future for this approach - especially as datasets multiply as they will continue to do. If there were a way to agree on a standard for cataloguing (maybe I need a different term?) these data, then various applications could make use of them.
Here are a few more examples of how I see it could be used:
- Quick access for browsing and searching for files (could be used against databases too of course) without having to touch the data or file system
- Building UIs for grouping and aggregating datasets of common type
- Quickly Reviewing bounding boxes of your entire spatial collection
- With periodic snapshots of the index you could also do some sort of checksum or timestamp comparison to see what has changed
- Importing into a true metadata catalogue as a basis for populating higher level information manually
- Importing into a W*S framework for easy previewing of or sharing of data with others in a network
- etc... See any other uses?
This is a long-winded explanation but I hope it helps set the context a bit better.
Tyler Mitchell
1-March-2008

Will Cadell:
The benefits of this kind of script are really numerous. I am a consultant and I know that I will live and die by the quality of data provided to me to analyse. So I need a way to provide my various clients with a a data quality report which they can sign off. This allows me to continue wading though data quality issues in the full knowledge that anything I do is done in the context of a mutual understanding between consultant and client. Otherwise I would be doing the entire project, fixing stuff, possibly even changing things, maybe assuming too much. Then have the client get annoyed when they see a certain result they might not have expected. This script is a first step toward that. It is definitely a useful GIS Project Management tool.
Mateusz Loskot:
The catalogue could also be used to generate workspaces for QGIS, uDIG and other desktop GIS applications.
The format seems to be fairly complete so I have only few comments:
Post new comment