[TagCommons-WG] mechanisms for sharing tag data
Tom Gruber
onto at tomgruber.org
Wed Mar 7 07:04:31 PST 2007
In regards to the conversation among Nitin and Richard and Marja about the
difference between database-level specifications and ontologies:
An analogy might help. Ontologies are to database schemas as database
logical designs are to physical designs (denormalization, precision choices,
etc). In other words, ontologies are an abstraction away from the details
toward the conceptual.
Both ontologies and database modeling are formal with Standard languages and
open source tools. Both are amenable to modeling methodologies and formal
languages. Any model describable at the database level can be described at
the ontology level, so there is no loss of power from specifying at the
ontological level. For example, if you want to model the world in terms of
a traditional Model Driven Architecture (MDA) and UML, you can easily do it
because UML is expressively simpler than the languages used for ontology
definitions.
http://www.sfu.ca/~dgasevic/Tutorials/ISWC2005/
But why bother with the ontology level?
The point of going more abstract is exactly because you don't want to have
to drink someone else's koolaid or store your data in someone else's format.
(Committing to an ontology definitely does NOT require that data ever be
*stored* in RDF tuples -- just as buying in to SQL doesn't require that you
use a particular table management technique.)
By describing data in a common ontology, one is not agreeing to share a
common data model but rather to have a common language with which to capture
commonalities and differences among data sources. For instance, at Tag Camp
there was talk about having tags point to tags. Why not, it's
computationally easy to do this and you can use the tag-to-tag relation all
kinds of ways (clustering, synonymy, etc). But just saying that it that
relation is many-to-many does not tell you what it means in a way that can
say whether the tag-to-tag tuples are comparable across any two systems.
That is because just describing the syntactic data integrity constraints
does not tell you enough about the semantic commitment behind using that
relation. On the other hand, you could agree that there are a few ways to
talk about tag-to-tag relationships, such as an explicit relationship among
tag labels such as "isSameTagLabel". Then you explicitly say that in system
A, isSameTagLabel is case and space sensitive string matching, and in system
B it is culture-specific, case insensitive, phase canonicalizing matching.
Then, for instance, if you were comparing the frequency of tagging with some
string on the two systems, you would know that a match in system A implies a
match in system B, but not vice versa. Or you could have a simple identity
matcher that knows how to transform queries or results when talking to
system A, so it would be consistent with system B.
<soapbox>
Folks, this is not a new thing, and the tags problem is really quite a
trivial case that we ought to be able to come to some agreement on.
Compare, for instance, the problem of data integration among all the world's
geoscience data. There are thousands of data sources in as many formats and
schemas, and data sets which are quite large scale and complex (Google Earth
is a tiny subset). After decades of database-level standards -- even
massive controlled vocabularies -- this community is turning to ontologies
as an enabling technology for data integration across these disparate
sources. For example, a bunch of work under the organization called GEON
(http://www.geongrid.org/about.html) is using ontologies for *describing the
data* from various sources so tools can reason about the relationships and
how to do integrated query and compute services over them. Based on the
*semantic* descriptions of the data (much more than cardinality and type),
there are systems that can map from scientific hypotheses to operational
queries from databases of geography, geology, climate, and remote sensing
data on the biosphere (if you care, look at work by geoinformatics
researchers Krishna Sinha, Boyan Brodaric, and Mark Gahegan). The data are
not only different in type, but are different in the modeling assumptions,
resolutions, and even notions of completeness across country and state
borders. There are similar activities for ontology-based, intelligent data
integration in fields such as biomedical data (NLH). So if they can do it
for massively complex data sets, we can do it for tags data. Besides, we
are us and not them. :-)
</soapbox>
To me, the Catch 22 for ontology-based sharing is the assumption that you
need to get other people to do more work to commit to your ontology, which
will then give it a network-effect of value to everyone. I think we can
break free of this paradox in simple cases such as tag data, by
bootstrapping multiple levels at once. In an ideal world, we can present a
stack of ways to buy in, from the top down:
- the abstract conceptualization (what is a tag assertion, etc)
- its specification in a standard ontology modeling language (eg,
OWL)
- its data access over something like SPARQL
- reference implementations in high performance database schemas (Nitin?)
- examples of wrappers for important tag sources (flickr, delicious, etc)
- examples of natively compliant tag sources (revyu, etc)
- reference implementations of applications that reason over multiple tag
sources (identity matching, clustering & visualization, tag-based search)
Within the ontology level, we can easily deliver the specification in
multiple formats including UML and ER diagrams (I thing these are done by
tools based on OWL input) and even example database designs. The important
thing is get something that can be useful to lots of different stakeholders
for the problems they currently face without assuming they share the same
tools, data storage, or reasoning services.
--tom
More information about the Wg
mailing list