Web overlays: Some thoughts about machine-processable knowledge representation

,

(I have been pondering on whether I should post this for some time now. Let’s see what happens.)

A while back I read the “Google exec challenges Berners-Lee” article and I was trying hard to find any real challenge to Tim Burners-Lee‘s position. I am very much a believer of machine-processable knowledge representation (perhaps influenced by the books I’ve been reading lately). So what if people/companies put false information out there by accident or try to misdirect us on purpose? So what if knowledge is not captured as accurately as we might have liked? It’s not the fault of the technology or the thinking behind it. Deception, misinformation, wrong facts, etc. are all part of our daily lives (electronic or not), part of our society, an unfortunate reality of our times. In the same manner we learn how to educate ourselves by having a well-rounded opinion about current affairs, in the same way we learn how to be critical of the news/facts sources in the real world, we should equally try to arm ourselves with the best tools technology has to offer in the electronic world. We could use technological advances in software/architecture in order to automate the process of keeping the culprits cornered in our electronic world, to filter them out. I guess junk filters are already doing something very similar (the >300 messages per day automatically removed by my mail provider are testament to the way automation helps me). We should use the technology to our advantage rather than attack technological advances because of the possible misuses.

I related the above article with some of the thinking I have been doing over the last few months around the Web and knowledge representation (aka ‘Semantic Web’). I have hinted in past posts about the concept of ‘Web overlays’ that has been torturing my brain for a while. The following is a short summary of what the concept represents (if there is anything there at all, that is).

What are Web overlays?

Going into the future, I see hubs of information on the Web which will be collecting and process representations of knowledge from all around the world. The next search engine will not just do text-based search but will also do interpretation of information; it will give us semantically-rich results; it will be able to automatically reason about the information it collects, interpret it within different contexts, tailor its answers towards our specific, domain-specific needs. Actually, we are already seeing examples of collaborative or semantics-based search engines.

I believe that social networks, tagging, and various semantic annotation technologies are hints towards a “knowledge inference” future. Webs of information overlayed on top of the same data, interpreted in different contexts, knowledge or new information automatically inferred and delivered, an electronic world where people on the Web are equally producers of information as they are consumers (with the latter being mostly the case today).

The syndication paradigm (feeds and permalinks) in combination with Semantic Web concepts/ideas can be used to introduce new information layers on top of the traditional Web; these are called  Web overlays’. The concept doesn’t really represent anything new, it’s nothing radical. It’s about using microformats to ontologically capture information, URIs to correlate instances of the captured information, and syndication technologies to consume the produced information. Web overlays are created through the combination of well-known tags or URIs for creating relationships between lists (directly or indirectly). URIs here do not necessarily mean HTTP (i.e. a simple pointer to some resource); instead, a URI represents a relationship, it is an ontology-backed reference to some representation, a resource, a person, some knowledge, a concept, etc. For example, “this document contains the list of books I am about to read”, with each book identified as an ISBN URI [1]. Or, “this is the list of emotions I went through while watching this movie”, with both the emotions and the movie identified as URIs. An Overlay Web is not just about following links as REST teaches us; it’s not about state transfer and it’s definitely not about a hypermedia state machine. It’s all about resource representations, correlations between concepts and captured information, an expanding network of facts/statements about all aspects of our lives, a distributed network for captured knowledge.

Future aggregators will be responsible for processing and trying to make sense of all the captured knowledge and help us manage it, reason about it, infer new facts based on it. For example, we should be able to mine information on the Web in order to answer questions like “what was the most popular book over the last month?”, “how do teenagers feel about this movie?”, “who else has photographs of this building?”, “what do people of this country feel about the potential introduction of a new law on subject X?”, etc. Some of these questions can already be answered to a certain extend using today’s technologies. I have a suspicion that social networking is going to be the motivating factor for the adoption of Semantic Web related ideas, like the Web overlays. The processing of the information people produce is going to be at the centre of the next evolution on the Web rather than the consumption of the information which is already out there.

I am thinking of writing a paper on this topic (as always, working together with Jim) and submit it for consideration to WWW07. This is an unconventional way of going about it. I am effectively announcing the intention to write a paper rather than start talking about it after it has been written and peer-reviewed 🙂 I am hoping to further describe the idea in various blog posts (if people think that it makes sense and it’s not something completely wacky) and also release specs and technology to support it. I am counting on the community’s participation and feedback. If people are interested in contributing, they should contact me. Let’s see if this is going to work. It could all be a disaster (i.e. there is no merit or anything new in the ‘Web overlays’ concept) or something interesting could come out of it (if nothing else, at least attract attention to the Semantic Web). No matter what, I am hoping that interesting discussions/food for thought will emerge.

[1] The URIs could indeed be HTTP ones as per the discussion on identifiers/links that took place on this blog few weeks back.

4 responses to “Web overlays: Some thoughts about machine-processable knowledge representation”

  1. I’m thinking about overlay networks along similar lines from a quite different starting point. However, the basic asymmetry in the producer/consumer relationship is in my opinion an ‘identity’ issue. The central web sites that people use to produce information are persistent and well identified. The browsers that people use to consume information are transient and largely anonymous. I’m leveraging the Skype peer to peer application messaging API in my work since all the nodes are well identified (via PKI) and there is an essential symmetry as a starting point.

    Can you reduce your concept to a use case that could be tested in a simulation? I have a very efficient p2p/web simulator that can model large scale networks, it might be an interesting collaboration….

  2. Where you will get the semweb crowd interested is that this set of layers is effectively missing form that architecture. There has been an implicit assumption that the data will be tidy and mostly consistent – that’s useful to academics becasue they can concentrate on the formalisms, but it leaves too much unsaid about real deployments. Which is one reason why the semweb hasn’t seen mass adoption – most data out there is in such bad shape you can’t even parse it.

    It’s funny; I’m sitting on a blog post saying much the same about markup soup parsers as being the lowest layer. Things like Tidy, Universal Feed Parser and BeautifulSoup are fundamental to what you’re talking about here – syntactic convertors that work just like analog-to-digital converters. They set you up for making sense of the data. After that it’s what the logicians call “abduction” – inferring to the best possible explanation (aka “guessing”). You can’t do this without heuristics and series of “smoothing” layers that normalise the data.

    Do some more looking into machine learning (stuff like InterRRap) and one shot learning theory. Getting signal from noise is also important research in robotics. There’s good basic work there in how to design layered hybrid architectures to support a reasoning engine. Also for something that works right now in terms of answering questions and realying new data back out, take a look at Nature’s Urchin engine:

    http://urchin.sourceforge.net/

    Finally, I think you’re being unreaslistic about ignoring issues like deception and dissimulation in metadata. The risk is creating technology that won’t function outside a lab environment, because if you can’t distinguish malicious intent then your systems can be gamed (for example this is bad for automated scenarios where payments are involved). I think a instead a better tactic is to accept that people lie, and not designing for this is a fallacy the exactly the same way “the network is reliable” is a fallacy at the transport layer. There’s a good argument to be had that Google and the Semweb community are already making these kinds of layer 8 fallacies.

  3. Hey Adrian,

    This sounds very interersting. I’ll get in touch to get this further perhaps.

  4. Bill,

    Thank you so much for your great comments. I am going to follow up on your suggestions.

    I don’t think I am ignoring the issue of deception and dissimulation in metadata. On the contrary… I merely suggested that we can use technology to deal with it because there is always going to be badly captured information or malicious intent. We just need to accept the fact and try to make use of the best tools at our disposal. Obviously I didn’t get that across very well.

    Anyway, thanks for taking the time. I am going to be proposing a set of microformats very soon.