First of a small series of small thoughts coming out of UKGovCamp 2011.
One of the key points emerging for me is just how much “data” is tied to the groups or people using it - not just the content, but the structure of it, the tools used to manage it, the background of the data, the assumptions behind it, and so on and so on. This comes up all over the place - standardised, central taxonomies often fall out of favour for being a jack-of-all-trades, useful to none. File formats are a direct result of people wanting data as an easy-to-edit spreadsheet, an easy-to-email PDF, or an easy-to-parse data file.
More fundamentally, even the understanding of what a dataset means becomes embedded in the structure of the data. If something is being measured , what defines that thing? What assumptions are inherent to the way data is measured? Is a van a form of car? More importantly, why are these definitions in place? Reasons are forgotten long before hard drives expire.
If you assume that all data is “relative” - I.e. a combination of the data itself and the people viewing it - what does this mean for linking it? Do we need more effort on translation? Or do we need more effort on fuzzy inferences between metadata, rather than direct mapping? (I suspect the Semantic Web rears its head here, but to me it always feels like a simpler solution is waiting to be seized on.)
Knowing how and where to ask questions about a dataset is a huge part of this - metadata about origins and background is vital; questions help build up an idea of how to fit a dataset into your own world view, your own data model, or your own database. Perhaps development needs to focus on making these links between contexts as transparent as possible, rather than fixing a single, over-arching context in place to fit all.