What's going on (from twitter)
Archive: March 2008

Lots of clever people are gathered in Yahoo's headquarters for the first Data-Intensive Computing Symposium. I see lots of familiar faces and the program looks VERY interesting.

I really liked Randy Bryant's change of his DISC term (PDF download)... Data-Intensive Super Computing -> Data-Intensive Scalable Computing :-)

Relationships can have properties as well
26 Mar 2008, Updated: 26 Mar 2008

I was asked a very good question by the people who are going to be using our "research-output" platform. They want to be able to capture information like this: "Paper P was authored by Author A while A was a Microsoft employee". The use case is obvious. In Microsoft, like any other organization, people come and go. It is important to be able to capture whether a researcher's stored work was undertaken while they were employees of the company.

Our "research-output" platform does not support this type of information explicitly. This is because we are not building an identity system. Instead, we expect that information about people are stored somewhere else (e.g. Active Directory, LDAP, etc.). Applications built on top of us make use of our API to capture the necessary information so that they can relate the information in our store with that residing elsewhere (through the use of URIs for example). So how can we support the above use case? Well, there are few approaches. Please allow me to expand on my favorite one.

As I said in my Microsoft and "Research-Output" Repositories, our "research-output" platform stores relationships between resources. Also, I suggested that our model is extensible. Additional information can be attached to a relationship. For example, a triple of the form

    <Subject, Predicate, Object>
  

can have additional information associated with it in the form of name-value pairs (any number). This is our way of enabling developers of associating extra information for a relationship. In our initial release, the 'value' in the pair can only be a string. We are going to think how we can support typed values as well (not very easy).

    <Subject, Predicate, Object, [<name, value>]*>
  

The above scenario can now be represented through the following tuple:

    <Paper P, Authored By, Author A, <While at Microsoft, True>>
  

Another scenario is the ordering of authors. Imagine you have the following triples:

    <Paper P, Authored By, Author A> 
<Paper P, Authored By, Author B>
  

What is the order of the authors? We can add a name-value pair to indicate the relative ordering between Objects when the Subject and the Predicate are the same.

    <Paper P, Authored By, Author A, <order, 2>, <While at Microsoft, True>> 
<Paper P, Authored By, Author B, <order, 1>, <While at Microsoft, False>>
  

Actually, we thought that the ordering scenario was common enough that we made it a typed property of our "Relationship" class.

Comments feedback are always more than welcome.

If you write papers and you interact with publishers (especially in the bio field), I think you are going to like our thinking with this add-in for Office. It supports the XML format used by the National Library of Medicine, an easy way to collect metadata about the paper, and templates that publishers can provide to authors. Here's the overview from the download page...

This Technology Preview release of the Article Authoring Add-in for Microsoft Word 2007 provides authors of scientific articles with the ability to read and write files from Word 2007 into the XML format used by the National Library of Medicine for archiving articles in the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature, PubMed Central.
This Technology Preview release is targeted at the staff of scientific and technical journals, Information Repositories, and early adopters within the scholarly authoring community, as well as developers of publishing solutions and workflows.

This work was lead by Pablo Fernicola and supported by my team, under the direction of Lee Dirks. It's a fantastic effort and there is more to come.

The add-in is part of the ecosystem of tools and services, which of course includes the "research-output" repository, to support the scholarly communications community. If you are at the Open Repositories 2008 grab me to show you a demo of everything that we do.

Microsoft and "Research-Output" Repositories
24 Mar 2008, Updated: 24 Mar 2008

imageWhat is Microsoft going to show at the Open Repositories 2008 conference in few days? Why is the entire "scholarly communications" section of the Technical Computing team going there? :-) Lee Dirks, Alex Wade, Santosh Balasubramanian (an honorary member!), and I are going to be there to interact with the community and to showcase—for the first time externally—our "research-output" repository platform.

We are looking forward to sharing our efforts over the last few months. We have been working hard on a platform for building repository-related services and tools. Our goal is to abstract the use of underlying technologies and provide an easy-to-use development model, based on .NET and LINQ, for building repositories on top of robust technologies.

imageThe platform has a "semantic computing" flavor. The concepts of "resource" and "relationship" are first-class citizens in our platform API. We do offer a number of "research-output"-related entities for those who want to use them (e.g. "technical report", "thesis", "book", "software download", "data", etc.), all of which inherit from "resource". However, new entities can be introduced into the system (even programmatically) while the existing ones can be further extended through the addition of properties.

This means, obviously, that arbitrary relationships between resources can be established. Our platform comes with a number of "known" predicates (e.g. "added by", "authored by", "cites", etc.) but it is extensible to accommodate any new predicates developers want to introduce. Furthermore, we do not interpret the semantics of the relationships; we let applications define how to reason about them.

The concept of a "relationship" may make many think that we are building a triple-store, perhaps even speculate that we are using one. While we do store <subject, predicate, object> tuples, we have opted for a hybrid approach between a fully-blown relational schema and a triple-store. Our thesis is that by sitting in the middle of the "triple store <–> relational schema" spectrum, we will be able to stay flexible enough without impacting performance.

image

We’ve been interacting with the SQL Server group that was responsible for WinFS and tried to make sure that all their technologies (e.g. Entity Framework, FILESTREAM feature in SQL Server 2008, etc.) are incorporated/applied correctly and also learn from their experience. We believe that the approach to developing applications using our platform is intuitive and we hope to facilitate the emergence of an ecosystem of tools, services, and (Web) user interfaces, exactly like WinFS endeavored to do.

The first milestone of the platform is almost ready (less than 3 weeks left to go). The initial release will be only available internally to Microsoft since we have focused on supporting Microsoft Research’s efforts to build its own "research-output" repository. Our system will be the back-end of a future version of the Microsoft Research web site. After Milestone 1, we’ll focus on an immediate public release, which is going to be free for download by the community. In fact, we are seriously thinking of even releasing the code to CodePlex for the community to take and extend.

Since this is a platform, here’s a code snipet that illustrates the type of programming experience we are supporting.

           1:
        // Create a representation for Jim
      
           2: Person jim = new Person { FirstName = "Jim", LastName = "Webber" };
           3:  
           4:
        // Create a representation for the lecture
      
           5:
        // "Presenters" is how the "presented by" known predicate surfaces through the API
      
           6: Lecture lecture = new Lecture { Title = "Does my Bus look big in this?" };
           7: lecture.Presenters.Add(jim);
           8:  
          10:
        // "Authors" is how the "authored by" known predicate surfaces through the API
      
          11: Book book = new Book { Title = "Realising Service Oriented Architectures Using Web Services" };
          12: book.Authors.Add(jim);
          13:  
          14: Person savas = new Person { FirstName = "Savas", LastName = "Parastatidis" };
          15:  
          16:
        // Introduce a new predicate
      
          17: Predicate friends = new Predicate { Uri = "urn:relationships:isFriend" };
          18:  
          19:
        // Associate Jim and Savas
      
          20: Relationship jimsavas = new Relationship { Predicate = friends, Object = savas };
          21: jim.RelationshipsAsSubject.Add(jimsavas);
          22:  
          23:
        // It enumerates all relationships in which
      
          24:
        // Jim participates (including the known ones)
      
          25:
        foreach (Relationship r in jim.RelationshipsAsSubject)
          26: {
          27:     Console.WriteLine("tuple: <{0}, {1}, {2}>",
          28:         r.Subject.Uri,
          29:         r.Predicate.Uri,
          30:         r.Object.Uri);
          31: }
          32:  
          33:
        // And here's something for the Web folks :-)
      
          34:
        // Upload a file
      
          35: File file = new File();
          36: context.UploadFileContent(file, "path to powerpoint 2007 presentation");
          37:
        // "Representation of" is one of the known predicates
      
          38: lecture.ResourceRepresentations.Add(file);

We are already well into the process of developing a collection of tools and interfaces on top of the platform as tangible examples of how to use it. We already have implementations of OAI-PMH, BibTeX import/export, customized feed syndication service, ASP.NET controls providing access to the repository, and working on Search and a simple Web UI. We are also working on WPF and Silverlight tools for visualizing the relationships between the resources within our repository. Here’s a video of a prototype to test the platform and the WPF control showing the relationships between randomly inserted resources and relationships. We are working on having different colors and labels for the edges in the graphs.

Now, you may notice that the graph in the video has some resemblance to that I’ve been using in my "data networking" talks and posts (e.g. "Age of Semantics" post :-)

At the Open Repositories 2008 conference, we will formally unveil our work in advance of its official release and initiate interactions/exchanges with the DSpace, EPrints, Fedora, and other players in the repository community. This is crucial to us because—like every other project our group undertakes—we are intensely focused on interoperability.

I want to be very transparent here: our effort is intended to provide a repository option to those institutions/organizations that already license or have access to Microsoft software (including the free versions of the products, like SQL Server Express). Our platform is intended to sit on top of the existing Microsoft "stack". By providing this new research-output repository platform at no cost, we can offer added value for our existing (and future) customers in the academic and research space. It is critical to point out that we are making every effort to ensure our platform is optimized to make the best use of Microsoft technologies AND to also interoperate with all other existing systems and platforms in the repository ecosystem. We are actively seeking engagement and feedback from the community!

This is an initial effort. We have long way to go before our platform, tools, and services match the features of those that have been around for years. However, we have to start from somewhere :-)

As you can tell, I am really excited to finally start speaking publicly about this project! And I look forward to your thoughts, comments, and ANY input on how we can improve our ongoing work in this space…

--

Many thanks to Lee Dirks for his contribution to the content of this post.

Loving last.fm
23 Mar 2008

After my latest soccer (erm... football) league game today, which we lost again :-(, I came to the office to catch up with some work. I installed last.fm for the first time and I am loving it. I felt like "Pink Floyd"-similar music while coding and voila...

Ah... the beauty of collective intelligence.

BTW... after Einar's recommendation I've been reading "Programming Collective Intelligence". I am only at the first chapter but it looks like an interesting book.

Experimenting with the myExperiment Web API
20 Mar 2008, Updated: 20 Mar 2008

I have been closely following the excellent work by myExperiment team. They recently presented at my team's All-Hands Meeting here in Redmond. It was nice to see David De Roure (co-principal investigator) and Jiten Bhagat (main developer) again. Microsoft is, of course, proudly co-funding the effort.

A Web API exposing all the data was one of the things I was keen to see from the myExperiment team. This does not only promote integration but also helps demonstrate the value of "software + services". I used the myExperiment Web API, WPF, and a library that we are developing to visual "data networks" to create this very simple example... (apologies for the quality but that's the best that YouTube can do I think).

The nodes in the graph that don't have photographs are the previews of the workflows associated with each user. One can imagine all sorts of different types there in addition to different types of connections/edges.

We are already working on a similar prototype for the "research output" repository platform, which we are going to demonstrate at the Open Repositories 2008 conference. It implements lots of the "data networking" ideas.

Oh... and when the WPF graph component is finished, its source code will be available. Christof Sprenger of DPE is doing an excellent job at driving the execution on this.

I just love it and I am jealous. We should be doing things like this as well. After the Gapminder folks moved to Google, we were all expecting something like this to be released. The execution is just beautiful... components that can be used in iGoogle or just be embedded in applications like Spreadsheets. Excellent execution and delivery!

Well done to Google. They truly get it (not that anyone waited for me to say this :-)

As I said in a previous post, I went to Whistler for few days. It was absolutely wonderful. I had a much needed downtime and enjoyed skiing. The first couple of days were fantastic. Sunday morning was really beautiful but in the afternoon it got cloudy and icy so I stopped the day short.

Panorama 2

(photos taken using my HTC Touch Cruise, so poor quality, and stitched together using Windows Live Photo Gallery)

On Saturday, and after having completed an off-piste steep run, I was merging into a cat track when I suddenly found my face on the snow and then doing a couple of rolls. I felt as if my skis had hit something hard. Well, at the side of the cat track and unseen to the naked eye, there was a well hidden, very sharp rock :-( The result? My brand new skis scared really badly (the photo doesn't really show how deep the "wounds" really are :-). I just left them at REI for repairs :-(

IMAG0013 

Lesson learnt? I am buying a helmet because that could easily have been my head! And who's going to bother Microsoft people about data networks then? :-)

Today Microsoft and Intel announced a joint endeavor with UC Berkeley and University of Illinois at Urbana-Champaign around parallel computing. I've been heavily involved in this, coordinating Microsoft's engagement on the technical side of things. Remember that photo with all the senior Microsoft and Intel technical folks back in the summer?

After a year of work and preparations, I am thrilled that this was finally announced. I am not going to go through the details since there is going to be plenty of coverage from the PR folks of all those involved. Let's hope that this is going to be the start of a new era in parallel computing.

Oh... and please don't ask me about the choice of name for the centers :-) It was my first exposure to the process of how big companies decide on names :-)

(Microsoft announcement)

Natural language processing-based search in Vista
17 Mar 2008, Updated: 24 Mar 2008

Here's a feature I didn't know about. Alex Wade, who was the technical Program Manager responsible for this functionality in Windows Vista, showed me how to enable it. It's a pity that such a great feature is disabled by default.

So... go to Folder Options (you'll find it in the control panel), select the "search" tab, and enable "use natural language search". Then, you can write queries into Windows Search like the following:

"Show me all email messages from Lee Dirks after Jan 24" or "show me all email messages from Lee Dirks with attachments".

The surprising thing is that the results are accurate :-)

Well done Alex! Very cool stuff!

Update:

After my initial post, I got some more information from Alex. The feature is not truly based on Natural Language Processing. It converts sentences to <property, value> pairs, which are used for the queries. This is why I shouldn't have typed "show me all" (not that I really wanted all Lee's messages... internal joke :-) Here are some more examples of queries:

  • email from lee sent last week about tci
  • documents by savas about repositories modified last month
  • feeds from jonas
  • music by david bowie
  • rock music rating *****
  • pictures about plants or flowers taken may 2006

Jonas Barklund has a great post with more details.

This article is 4 years old but it's the first time I read it. Antanas Mockus is not the mayor of Bogota anymore (according to Wikipedia). Definitely inspirational... I wonder how the world would look like if policy-making followed Mockus' example!

(Thanks to my friend Onoufrios for forwarding)

As I mentioned last week, my group is about to make a big announcement. The WSJ article tells only half of the story. Stay tuned until tomorrow!

Whistler
13 Mar 2008

I was at Whistler last weekend and I am going there again in a couple of hours. Yes, I finally took some time off (today and tomorrow). Last weekend it was fantastic but didn't really enjoy skiing. This week I will also spend some time with my friend Colette and will introduce Einar to Whistler. I do hope to finish one of those Web book chapters :-)

IMG_6492 IMG_6482
IMG_6490 IMG_6502

(weekend of March 8-9)

Finally, one of the big things I've been working on is coming into light... This is going to be a big announcement :-) More coverage and info next week.

This is very impressive.

From Engadget

(update: as Dan pointed out, the video has been removed).

The Age of Semantics
11 Mar 2008, Updated: 24 Mar 2008

In a recent presentation by Tony Hey on "eScience and Semantic Computing" (large PDF), which is heavily influenced by Evelyne Viega's and my ideas, the "ages of semantic computing" were presented (Evelyne's idea) as a way to capture how our thinking has evolved over the years...

  • Enlightenment age: "It is not all about philosophy, it is about ARTIFICIAL INTELLIGENCE"
  • Romantic age: "Well, it is not about AI… but we can model the WORLD’s KNOWLEDGE. And we should!"
  • Psychedelic age: "Perhaps we cannot model the world’s knowledge but we can declare success if universally agree to some common labels"
  • Renaissance age: "Nah… Ontology is overrated… Let’s use community-driven categories instead: (tags, folksonomies, microformats, etc.)"
  • Age of reason: "Modest, simple steps. Embrace the loosely-coupled, chaotic nature of the Web. Make use of the experience and expertise built throughout the previous ages. Machine learning, statistics, formats, ontologies… they all have a role to play."

The full presentation (apologies for the quality of the images in the PDF output) is available from the Practical Semantic Astronomy workshop website.

In the presentation, you'll notice the concept of "data networks", which is something I've been writing about for few years now (remember "Web overlays: Some thoughts about machine-processable knowledge representation"?). In some of my Grid/Cloud computing talks I've also hinted at the concept, using the following diagram...

image

Well, Evelyne and I have been interacting with some very clever people inside MIcrosoft. We are all excited about the opportunities of "semantic computing" in the future of software and services. Andrea Westerinen is amongst those people. She is part of my previous team, the Connected Systems Division architecture team, and she's recently started blogging about modeling, semantics, etc. Very cool! Reading her first few entries reminded me of the "declarative distributed computing" ideas accompanying MEST in my blog entries few years back (e.g. the "On description languages, REST, the Web, MEST, SSDL, and 'declarative distributed computing'" post from back in May 2005). I came to Microsoft to live that dream but it didn't really work out immediately through my initial gig. I am thrilled now to see ideas taking shape and I am excited with what the CSD folks are preparing for us (even though I don't have any inside information, I suspect that we'll be hearing a lot more about what's in store in the upcoming developer conferences). It goes without saying that I've subscribed to Andrea's blog and I am really looking forward to further interacting with her.

Evelyne and I are creating a community inside Microsoft around the concept of "semantic computing". We use this term in order to avoid the reaction that some have when they hear "Semantic Web", "RDF", "OWL", ontologies, etc. We really don't care about specific technologies or approaches. Give us the ability to reason about data, to infer information, to manage knowledge and call it anything you like :-)

Back to Data Networking... well, in the same presentation on "eScience and Semantic Computing", you'll notice that there are some slides about our "research output" repository platform. This is a project I've been working on for the last 3 months with a fantastic group of people. We have finally agreed to start talking about it in public so I am really looking forward to writing more. The platform implements some of the ideas around "data networks" (not very dissimilar to Tim Berners-Lee's "Giant Global Graph"). Stay tuned in the next couple of days.

I've been very lucky over the last few months to collaborate and interact with clever people like Evelyne Viegas, Jim Karkanias, Andrea Westerinen, and others who really "get it". It's been and continues to be a blast!

Mac vs PC vs Linux
1 Mar 2008, Updated: 2 Mar 2008

Jim just pointed me to this funny video. Very funny :-)

(seen in Michael Calore's blog)