Microsoft and "Research-Output" Repositories

imageWhat is Microsoft going to show at the Open Repositories 2008 conference in few days? Why is the entire "scholarly communications" section of the Technical Computing team going there? 🙂 Lee Dirks, Alex Wade, Santosh Balasubramanian (an honorary member!), and I are going to be there to interact with the community and to showcase for the first time externally our "research-output" repository platform.

We are looking forward to sharing our efforts over the last few months. We have been working hard on a platform for building repository-related services and tools. Our goal is to abstract the use of underlying technologies and provide an easy-to-use development model, based on .NET and LINQ, for building repositories on top of robust technologies.

imageThe platform has a "semantic computing" flavor. The concepts of "resource" and "relationship" are first-class citizens in our platform API. We do offer a number of "research-output"-related entities for those who want to use them (e.g. "technical report", "thesis", "book", "software download", "data", etc.), all of which inherit from "resource". However, new entities can be introduced into the system (even programmatically) while the existing ones can be further extended through the addition of properties.

This means, obviously, that arbitrary relationships between resources can be established. Our platform comes with a number of "known" predicates (e.g. "added by", "authored by", "cites", etc.) but it is extensible to accommodate any new predicates developers want to introduce. Furthermore, we do not interpret the semantics of the relationships; we let applications define how to reason about them.

The concept of a "relationship" may make many think that we are building a triple-store, perhaps even speculate that we are using one. While we do store <subject, predicate, object> tuples, we have opted for a hybrid approach between a fully-blown relational schema and a triple-store. Our thesis is that by sitting in the middle of the "triple store <-> relational schema" spectrum, we will be able to stay flexible enough without impacting performance.

image

We've been interacting with the SQL Server group that was responsible for WinFS and tried to make sure that all their technologies (e.g. Entity Framework, FILESTREAM feature in SQL Server 2008, etc.) are incorporated/applied correctly and also learn from their experience. We believe that the approach to developing applications using our platform is intuitive and we hope to facilitate the emergence of an ecosystem of tools, services, and (Web) user interfaces, exactly like WinFS endeavored to do.

The first milestone of the platform is almost ready (less than 3 weeks left to go). The initial release will be only available internally to Microsoft since we have focused on supporting Microsoft Research's efforts to build its own "research-output" repository. Our system will be the back-end of a future version of the Microsoft Research web site. After Milestone 1, we'll focus on an immediate public release, which is going to be free for download by the community. In fact, we are seriously thinking of even releasing the code to CodePlex for the community to take and extend.

Since this is a platform, here's a code snipet that illustrates the type of programming experience we are supporting.

         1:
      // Create a representation for Jim
    
         2: Person jim = new Person { FirstName = "Jim", LastName = "Webber" };
         3:  
         4:
      // Create a representation for the lecture
    
         5:
      // "Presenters" is how the "presented by" known predicate surfaces through the API
    
         6: Lecture lecture = new Lecture { Title = "Does my Bus look big in this?" };
         7: lecture.Presenters.Add(jim);
         8:  
        10:
      // "Authors" is how the "authored by" known predicate surfaces through the API
    
        11: Book book = new Book { Title = "Realising Service Oriented Architectures Using Web Services" };
        12: book.Authors.Add(jim);
        13:  
        14: Person savas = new Person { FirstName = "Savas", LastName = "Parastatidis" };
        15:  
        16:
      // Introduce a new predicate
    
        17: Predicate friends = new Predicate { Uri = "urn:relationships:isFriend" };
        18:  
        19:
      // Associate Jim and Savas
    
        20: Relationship jimsavas = new Relationship { Predicate = friends, Object = savas };
        21: jim.RelationshipsAsSubject.Add(jimsavas);
        22:  
        23:
      // It enumerates all relationships in which
    
        24:
      // Jim participates (including the known ones)
    
        25:
      foreach (Relationship r in jim.RelationshipsAsSubject)
        26: {
        27:     Console.WriteLine("tuple: <{0}, {1}, {2}>",
        28:         r.Subject.Uri,
        29:         r.Predicate.Uri,
        30:         r.Object.Uri);
        31: }
        32:  
        33:
      // And here's something for the Web folks :-)
    
        34:
      // Upload a file
    
        35: File file = new File();
        36: context.UploadFileContent(file, "path to powerpoint 2007 presentation");
        37:
      // "Representation of" is one of the known predicates
    
        38: lecture.ResourceRepresentations.Add(file);

We are already well into the process of developing a collection of tools and interfaces on top of the platform as tangible examples of how to use it. We already have implementations of OAI-PMH, BibTeX import/export, customized feed syndication service, ASP.NET controls providing access to the repository, and working on Search and a simple Web UI. We are also working on WPF and Silverlight tools for visualizing the relationships between the resources within our repository. Here's a video of a prototype to test the platform and the WPF control showing the relationships between randomly inserted resources and relationships. We are working on having different colors and labels for the edges in the graphs.

Now, you may notice that the graph in the video has some resemblance to that I've been using in my "data networking" talks and posts (e.g. "Age of Semantics" post 🙂

At the Open Repositories 2008 conference, we will formally unveil our work in advance of its official release and initiate interactions/exchanges with the DSpace, EPrints, Fedora, and other players in the repository community. This is crucial to us because like every other project our group undertakes we are intensely focused on interoperability.

I want to be very transparent here: our effort is intended to provide a repository option to those institutions/organizations that already license or have access to Microsoft software (including the free versions of the products, like SQL Server Express). Our platform is intended to sit on top of the existing Microsoft "stack". By providing this new research-output repository platform at no cost, we can offer added value for our existing (and future) customers in the academic and research space. It is critical to point out that we are making every effort to ensure our platform is optimized to make the best use of Microsoft technologies AND to also interoperate with all other existing systems and platforms in the repository ecosystem. We are actively seeking engagement and feedback from the community!

This is an initial effort. We have long way to go before our platform, tools, and services match the features of those that have been around for years. However, we have to start from somewhere 🙂

As you can tell, I am really excited to finally start speaking publicly about this project! And I look forward to your thoughts, comments, and ANY input on how we can improve our ongoing work in this space...

-

Many thanks to Lee Dirks for his contribution to the content of this post.

11 responses to “Microsoft and "Research-Output" Repositories”

  1. Here at ChemSpider we have been working with other people at Microsoft on integrating ChemSpider web services into their system (see for example: http://www.chemspider.com/blog/microsoft-advance-the-web-services-integration-with-infomesa.html)

    At ChemSpider we have built a free access database of chemical structures and related information and data. We’ve done it all on the Microsoft platform and are now working to produce WiChemPedia (http://www.chemspider.com/blog/wichempedia-is-now-on-its-way.html).

    on Microsoft Sharepoint (http://www.chemspider.com/blog/the-chemspider-team-chooses-our-future-platform-for-collaboration-microsoft-sharepoint.html)

    I wonder how this new capability you are developing can add new functionality to our already existing platform. I’d love to connect…

  2. “we are intensely focused on interoperability” – so can we look forward to Web standards-based linked data made available via RDF over HTTP, along with SPARQL endpoints? More fundamentally, can you confirm you’ll be giving entities and relationships dereferenceable URIs? (If not, I see no Web data interop).

  3. Hi Danny,

    Since we are building a platform, there is no reason why someone couldn’t create a tool to convert our data to RDF. In fact, we are discussing about the possibility of doing this ourselves. As for exposing each entity through a URI, again this depends on the deployment and the specific tools/services that are going to be used. Each resource in our store has a “URI” property but the applications built on top of us can decide how to make use of it and how to expose it.

    In milestone 2, we are thinking of working on implementing OAI-ORE (RDF-based vocabulary) export functionality for our compound entities. SPARQL support may be slightly more difficult but it is definitely something we are discussing!

    Nothing has been finalized yet so we are really keen to the feedback from the community.

    Regards,

    .savas.

  4. Thanks.

    Interop is all I ask 🙂

  5. “Our platform is intended to sit on top of the existing Microsoft ‘stack’.”

    Congratulations on the development and planned open release of the repository. I think your system will hit a sweet-spot for Microsoft-oriented shops that are looking to get into trusted repositories.

  6. Many thanks to you, Alex, Santosh and Lee for following up with me one on one at Southampton. As a Microsoft shop, I am looking forward to finding ways to incorporate your platform into some of our applications once it becomes available. Regards,

    Juan

  7. Hmm the platform seems to be nice, infact great.

  8. its really gret .

  9. One of the best ways to take advantage of video conferencing is to have the right equipment. Before getting started with a video meeting, it’s important that you familiarize yourself with the components that you’ll need.

  10. Great!Great!Great!

    I think that’s one of microsoft’s biggest achievement.

    Micro ROX!!

  11. This is great news, if you put this on Codeplex I might have to put my own project to build a hybrid ontology store with SQL Server on hold! If you do any early Beta or CTP releases I would be keen to get access to them.