Jim and I have been discussing graph store/database design for a long time now. As it is always the case, we argue :-) Whether it’s about graph store design, indexing, or caching and partitioning, we will always find a topic on which we can have fun (most of the time over beers :-)
I started working on my first graph store, Zentity, back in 2006-2007, during my Microsoft Research years. Even back in those days, I talked about relationships with properties and relationships on relationships as key design choices for a graph store. I started thinking about the role of First Order Predicate Logic with Probabilities and ways to implement large, scalable graph stores that would efficiently provide access to the world’s information and knowledge. Zentity was the beginning of my exploration with many graph-related ideas. That exploration hasn’t stopped yet :-)
The Neo4j team made an early design decision to support properties on graphs and relationships but to not support relationships as targets of or sources for other relationships. They coined the term “property graph” for the resulting model, a term that is catching on in the graph store community. I don’t know if the early implementations of neo4j supported properties on relationships but recent versions do.
Perhaps it’s the influence that Scheme and First Order Predicate Logic have had in my way of thinking but I’d like the property graph model to take that extra step in how relationships are treated. When creating a graph-based data model, I’d like to avoid any mental gymnastics in the way I represent information. Consider the following:
Bob = Person( FirstName("Bob") LastName("Smith") YearOfBirth(1973) )
Alice = Person( FirstName("Alice") LastName("Parker") YearOfBirth(1975) )
BobMarriedAlice = Married( Bob Alice Date(Apr 1, 2001) )
It’s easy to figure out what the above predicates represent. “Bob” and “Alice” represent two people while “BobMarriedAlice” represents the fact that they got married in 2001. They are all predicates. In fact, “FirstName”, “LastName”, “YearOfBirth”, and “Date” are also predicates. This is not dissimilar to the RDF and property graph models. The transformation from one representation model to the other is straightforward. With Neo4j, we can define “Bob” and “Alice” as graph nodes. Then, we can create the “Married” relationship between them. The relationship is given an identifier (“rel123”). Using that identifier we can add the “Date” property to the relationship.
Now, let’s say that we want to represent the following information:
Carole = Person( FirstName("Carole") LastName("Walter") YearOfBirth(1973) )
CaroleWitnessedMarriage = Witnessed( Carole BobMarriedAlice )
The above says that “Carole witnessed the fact that Bob married Alice on Apr 1st, 2001”. I don’t know if it’s a limitation of the graph property model or just a design choice for the neo4j implementation but, unfortunately, I cannot represent the above information as one might have intuitively thought.
Even though Neo4j assigns identifiers to relationships, those identifiers cannot be used as the target or source of another relationship. We have to reify the relationship in order to represent the information above. We now have to create a new graph node to represent the fact that “Bob married Alice” as a graph node.
Please note that we now have to introduce two additional “Participated” edges that only add complexity to our representation.* On the other hand, if we take that additional step in the model and allow relationships to participate in other relationships, then we won’t have to do the above. The general model can be defined as follows:
- Graph: set of nodes;
- Node: a dictionary of key-value properties with id as the mandatory key;
- Relationship: a dictionary of key-value properties with id, subject, predicate, and object as mandatory keys.
With the above model, we no longer need the introduction of additional edges in the graph.
Please note that this is not a hypergraph, even though I have been thinking about hypergraphs as a way to represent knowledge. Unlike hypergraphs, the concept of a single directed edge still exists. It’s just that the relationships, as represented by a directed edge, can be associated with other incoming/outgoing relationships.
The choice of the above model has implications on indexing, navigation, optimization, query expressions, and so on. For example, if we wanted to search for all marriages, in the neo4j case we would have to search for all nodes with “type” property set to “marriage”. In the above case, we would have to search for all edges with the “predicate” property set to “married”.
That was a long way to say that the model I was describing to Jim was not a hypergraph :-)
For the record, I think that the property graph approach implemented by Neo4j is great. The team is doing an awesome job all around and I am a huge admirer of their work. It’s not a surprise that they are the number one graph database in the industry today.
At the end of the day, the two approaches are very similar. Neo4j already represents an awesome implementation of one of the models so if you want to play with graphs with the best system out there, the choice is obvious. It’d be great if the fine folks at Neo Technology considered my proposed addition to their model. Since their relationships already have identifiers, they are almost there :-)
I’d very much like see the property graph approach (perhaps with the addition above) being adopted as the way for representing knowledge graph on the Web. Stay tuned! :-)
* BTW, the RDF graph data model suffers even more since we need to reify a relationship just to add properties to it.