On my previous article, I wrote about what the Semantic Web is without actually showing how the information is transmitted and stored. This article assumes that you are a technical savvy person, and you would like to know how to design your applications for the Semantic Web.
When you design a system in an Object Oriented way, you think in entities within a domain model. These business objects are usually mapped to Database rows, which are stored in tables. These database tables have a rigid schema, which is very complicated to update and maintain.
What happens if I want to create a new version of the system, whose business entities have different information stored? I would have to create a migration script, which will make the database unavailable for a certain time, and for most large databases the downtime is large enough to make the migration unaffordable.
Message-oriented middleware (MOM) databases have an interesting approach that tackles the issue. Each business object is part of a message that is exchanged between services. MOM databases store the entire message in one special field (column) within a data table (usually of XML type), allowing them to evolve the schema for their message entities and keep the DB schema (there is no DB downtime). Each entity is carefully designed to use simple data types and have an unique ID (to be uniquely identifiable). Having a standard representation helps easier the migration path when the business objects have their schema partially modified (an old application can just have their mapping information updated, and there is no need to update the DB schema).
If you design the business data storage using a similar design as MOMs, your entities will look like the following:
- Every entity will have an unique identifier that will become their identity within the system (and probably with external systems) and will be of a type.
- Every entity will have their data and relationships drawn as a directed graph
- Every data will have their type information embedded (so a person’s photo can be interpreted as a bitmap)
Following these principles will lead you to entities that are easy to represent and transmit in standard formats such as RDF or plain XML. Designing entities with unique identifiers also help synchronizing information that is related to a single entity.
Representing entities in RDF
The Semantic Web standardizes the way each entity is transmitted (and imported/exported) from Semantic Web systems. The entity we described in the preceding section can be represented in RDF as:
Evidently, most RDF entities will be much more complex than this example yet the example is useful to note:
- The type information is implicit when using a custom namespace and Tag name
- The unique identifier is standardized within RDF
- Attributes and relationships are stored as child tags
What is the logic behind RDF?
The sample entity is actually a collection of facts about the entity (representation). The facts are:
- Entity http://people/johnDoe is of type Person
- http://people/johnDoe has a Name of "John Doe" (string literal)
Semantic Web applications combines these facts with logic to represent more complex queries (for example, querying the list of HumanBeings (synonym) should return John Doe in the results).
The directed graph and the collection of facts are different views of the same information. Every piece of information is stored as: [Entity] [Relationship] [Value], which is the same as [Subject] [Predicate] [Object]. This is also known as triple pattern in SPARQL.
In natural language, you could write [Orson Welles] [was born on] [May 6, 1915]. It is the same when relating 2 entities: [Orson Welles] [directed] [Citizen Kane] *. The entities don’t have to be of the same type (in this example I’m relating a Person to a Movie).
(*) A better design would be to design [Citizen Kane] [was directed by] [Orson Welles], but let’s put this discussion off for now.
Can’t I just design a database this way?
Simplifying the model, you could have tables with the "Subject", "Predicate" and "Object" columns. The first 2 columns could be unique identifiers (a relationship must also be identifiable) where the last column should be a special field that can store either a reference to another entity (URI), a literal value (string, date/time, decimal, etc.) or a blob.
For simple facts and queries this approach may work. It could even be the starting point for Semantic Web system. Things get more complicated when adding validations (you can’t have more than 1 birthday but you can have more than 1 name) or inferences that generate new information ( "A isMotherOf B" whenever "B is child of A andAlso A is a woman"). This is why Semantic Engines are built (they hide the complexity of storage and provide an interface to access the Semantic information).
Products like IMM work with a backend database, and they use an intermediate component to encapsulate the process of retrieving and storing entities, and performing queries.
I’m planning to write an article about SPARQL (the Semantic Web querying language, standardized by the W3C) and ontologies, which give a context (and a meaning within the context) to the Semantic Web information, setting a common language.