XLDB 2011 conference – Observations

Trip report – XLSD Conference

Venue: SLAC (Stanford Linear Accelerator Center)
Conference: XLDB 2011

Overview

I attended the XLDB 2011 conference earlier in the week. XLDB seems to be an emerging conference. There were 280-300 people in attendance and more had to be turned away. In fact, there were people who turned up anyway, even though they hadnt secured a place, hoping that there would be cancellations.

My understanding is that XLDB emerged from the need to discuss technology, issues, and requirements for Big Data in the scientific community, especially when using supercomputing facilities. However, it is clear to everyone that Big Data is not specific to those who are doing supercomputing anymore. Even though we are moving into the exascale supercomputing era (the fastest supercomputer today can do >8 PFlops), having to deal with GBs, PBs, or even EBs of data is not an exclusive problem to those who have access to supercomputing facilities. Hence, it was not a big surprise to find out that the conference attracted the interest of the industry.

A number of leading industry leaders had a presence at the conference: Google, Microsoft, eBay, Netflix, LinkedIn, Facebook, Amazon (even though they decided at the last minute not to present), and IBM (to name few). There were talks by scientists as well.

Most of the talks were great and I enjoyed them a lot. I chose to highlight the Netflix, Metamarkets, and Novartis ones as the driving examples for my observations. The conference organizers have promised to publish the slides and the videos of the presentations.

The value of data

In my mind, the Big Data space is not a niche any more. Its not a space that any company offering enabling technologies, solutions, and services to its customers can afford to ignore. Many customers already have real problems, they already take advantage of Big Data processing infrastructures, and their competiveness is based on their ability to extract value and insights from the data they collect.

Take Netflix for example. Their VP of Data Science and Engineering (highlighting Data Science in the title!!!) gave an excellent talk on how Netflix won the DVD-shipping game, how they became competitive. It was all because of the data they collected and then analyzed. They heavily instrumented their DVD-handling equipment. Every single aspect of a DVDs route was recorded and sent to Netflixs data warehouse. Decisions within msecs had to be made about how to best route each disc. Data was collected and then processed in order to optimize all aspects of their business. They had become so good at it that their only bottleneck became the post office. They had reached such a level of data-based business intelligence that they even went to the post office and started helping them optimize their operations.

The VP of Data Science and Engineering at Netflix was a happy person until Netflix decided to get into the streaming business. Their data collection and analysis requirements skyrocketed!

Here comes the cloud

Netflix needed to expand its ability to process data and make business decisions. They really wanted to move away from the business of managing infrastructure. They didnt want to have to deal with operations, data centers, machines, and so on. They went through a migration period of progressively moving their entire data collection and processing infrastructure into Amazons cloud.

Granted, Netflix had to build their own pipeline based on open source technologies. They used the right tool for the job. They used a NoSQL solution for reliably gathering/recording their data at scale. They used an RDBMS where it made sense.

Netflix is a big company. They can build their own data processing infrastructure from the various pieces. However, what about all those smaller companies that want to collect and process data that is critical to their growth, competitiveness, survival? Wouldnt they benefit from cloud solutions that are scalable, reliable, and NOT managed by them!

Take Metamarkets as an example. They are doing predictive analytics that help advertisers around the world. Apparently the advertising game is following that of the financial markets. Advertisers need to be able to make decisions within few seconds. They need to analyze large amounts of data (billions of microtransactions per day) very fast.

Their needs for a very fast engine for doing almost real time analytics was not addressed by any existing solution. Metamarkets was born in the cloud and continues to operate in the cloud. They didnt have to transition to it like Netflix did. Nevertheless, they still had to build their own distributed, in-memory database (Druid) because none of the solutions they tried could meet their requirements. Given their domain of focus, thats effort that could have been avoided. Rather than focusing on infrastructure, they could have diverted their investments in offering better services to their customers. As it turned out, they managed to build a very good infrastructure that serves them well today.

The data analytics ecosystem

Companies like Vertica provide solutions for companies like Metamarkets. The value proposition is obvious. If you want to build a service or a product that is based/depends upon the processing of data at scale, then you dont have to build the infrastructure yourself.

This is not about deploying a database management system. This is not about just deploying Hadoop or a NoSQL store. This is about getting a complete solution for your big data analytics needs, tailored to your specific requirements (e.g. close-to-real-time processing, batch processing, scale, cloud, etc.).

Novartis happens to concentrate on providing solutions for the genomics/life sciences community. They utilize SciDB, an array-oriented parallel database. There are many companies like Novartis out there addressing different domains. Weve all heard about them and already monitoring them. The point is that such companies are offering solutions for real customer needs today. They reuse open source technologies in order to build an ecosystem of tools and services for their customers.

In my mind, a great opportunity resides in democratizing the data analytics ecosystem by offering scalable solutions at scale; that is, solutions that meet the compute- and data-processing scalability requirements of customers while doing so for 100s of millions of customers at the same time. An ecosystem that addresses all aspects of the Big Data space… data collection, management, processing, visualization, analysis, data mining, machine-based reasoning, and many more!

Isnt it a great time to be in the cloud + big data space? 🙂

XLDB was a great conference.