Microsoft and "Research-Output" Repositories

What is Microsoft going to show at the Open Repositories 2008 conference in few days? Why is the entire “scholarly communications” section of the Technical Computing team going there? 🙂 Lee Dirks, Alex Wade, Santosh Balasubramanian (an honorary member!), and I are going to be there to interact with the community and to showcase for the first time externally our “research-output” repository platform.

We are looking forward to sharing our efforts over the last few months. We have been working hard on a platform for building repository-related services and tools. Our goal is to abstract the use of underlying technologies and provide an easy-to-use development model, based on .NET and LINQ, for building repositories on top of robust technologies.

The platform has a “semantic computing” flavor. The concepts of “resource” and “relationship” are first-class citizens in our platform API. We do offer a number of “research-output”-related entities for those who want to use them (e.g. “technical report”, “thesis”, “book”, “software download”, “data”, etc.), all of which inherit from “resource”. However, new entities can be introduced into the system (even programmatically) while the existing ones can be further extended through the addition of properties.

This means, obviously, that arbitrary relationships between resources can be established. Our platform comes with a number of “known” predicates (e.g. “added by”, “authored by”, “cites”, etc.) but it is extensible to accommodate any new predicates developers want to introduce. Furthermore, we do not interpret the semantics of the relationships; we let applications define how to reason about them.

The concept of a “relationship” may make many think that we are building a triple-store, perhaps even speculate that we are using one. While we do store <subject, predicate, object> tuples, we have opted for a hybrid approach between a fully-blown relational schema and a triple-store. Our thesis is that by sitting in the middle of the “triple store <-> relational schema” spectrum, we will be able to stay flexible enough without impacting performance.

We’ve been interacting with the SQL Server group that was responsible for WinFS and tried to make sure that all their technologies (e.g. Entity Framework, FILESTREAM feature in SQL Server 2008, etc.) are incorporated/applied correctly and also learn from their experience. We believe that the approach to developing applications using our platform is intuitive and we hope to facilitate the emergence of an ecosystem of tools, services, and (Web) user interfaces, exactly like WinFS endeavored to do.

The first milestone of the platform is almost ready (less than 3 weeks left to go). The initial release will be only available internally to Microsoft since we have focused on supporting Microsoft Research‘s efforts to build its own “research-output” repository. Our system will be the back-end of a future version of the Microsoft Research web site. After Milestone 1, we’ll focus on an immediate public release, which is going to be free for download by the community. In fact, we are seriously thinking of even releasing the code to CodePlex for the community to take and extend.

Since this is a platform, here’s a code snipet that illustrates the type of programming experience we are supporting.

         1:
      // Create a representation for Jim

         2: Person jim = new Person { FirstName = "Jim", LastName = "Webber" };

3:

         4:
      // Create a representation for the lecture

         5:
      // "Presenters" is how the "presented by" known predicate surfaces through the API

         6: Lecture lecture = new Lecture { Title = "Does my Bus look big in this?" };

         7: lecture.Presenters.Add(jim);

8:

        10:
      // "Authors" is how the "authored by" known predicate surfaces through the API

        11: Book book = new Book { Title = "Realising Service Oriented Architectures Using Web Services" };

        12: book.Authors.Add(jim);

13:

        14: Person savas = new Person { FirstName = "Savas", LastName = "Parastatidis" };

15:

        16:
      // Introduce a new predicate

        17: Predicate friends = new Predicate { Uri = "urn:relationships:isFriend" };

18:

        19:
      // Associate Jim and Savas

        20: Relationship jimsavas = new Relationship { Predicate = friends, Object = savas };

        21: jim.RelationshipsAsSubject.Add(jimsavas);

22:

        23:
      // It enumerates all relationships in which

        24:
      // Jim participates (including the known ones)

        25:
      foreach (Relationship r in jim.RelationshipsAsSubject)

        26: {

        27:     Console.WriteLine("tuple: <{0}, {1}, {2}>",

        28:         r.Subject.Uri,

        29:         r.Predicate.Uri,

        30:         r.Object.Uri);

        31: }

32:

        33:
      // And here's something for the Web folks :-)

        34:
      // Upload a file

        35: File file = new File();

        36: context.UploadFileContent(file, "path to powerpoint 2007 presentation");

        37:
      // "Representation of" is one of the known predicates

        38: lecture.ResourceRepresentations.Add(file);

We are already well into the process of developing a collection of tools and interfaces on top of the platform as tangible examples of how to use it. We already have implementations of OAI-PMH, BibTeX import/export, customized feed syndication service, ASP.NET controls providing access to the repository, and working on Search and a simple Web UI. We are also working on WPF and Silverlight tools for visualizing the relationships between the resources within our repository. Here’s a video of a prototype to test the platform and the WPF control showing the relationships between randomly inserted resources and relationships. We are working on having different colors and labels for the edges in the graphs.

Now, you may notice that the graph in the video has some resemblance to that I’ve been using in my “data networking” talks and posts (e.g. “Age of Semantics” post 🙂

At the Open Repositories 2008 conference, we will formally unveil our work in advance of its official release and initiate interactions/exchanges with the DSpace, EPrints, Fedora, and other players in the repository community. This is crucial to us because like every other project our group undertakes we are intensely focused on interoperability.

I want to be very transparent here: our effort is intended to provide a repository option to those institutions/organizations that already license or have access to Microsoft software (including the free versions of the products, like SQL Server Express). Our platform is intended to sit on top of the existing Microsoft “stack”. By providing this new research-output repository platform at no cost, we can offer added value for our existing (and future) customers in the academic and research space. It is critical to point out that we are making every effort to ensure our platform is optimized to make the best use of Microsoft technologies AND to also interoperate with all other existing systems and platforms in the repository ecosystem. We are actively seeking engagement and feedback from the community!

This is an initial effort. We have long way to go before our platform, tools, and services match the features of those that have been around for years. However, we have to start from somewhere 🙂

As you can tell, I am really excited to finally start speaking publicly about this project! And I look forward to your thoughts, comments, and ANY input on how we can improve our ongoing work in this space…

–

Many thanks to Lee Dirks for his contribution to the content of this post.