savas parastatidis

Data formats – The adventures of reading a Flow Cytometry data file

2008-05-11

I try to spend few hours a week at the Armbrust Lab, trying to understand what the scientists there are trying to do and how their work could be helped by technology-based automation solutions, especially around the area of data management. The week before I sat in their weekly meeting, where Francois Ribalet talked about the problem he was facing in trying to process a number, rather than just one, Flow Cytometry data files. I thought… well, that’s a very easy to solve problem, right? That could be an easy-to-build solution… read the data, move it to Excel or SQL Server (or mySQL or Oracle… it doesn’t matter), and then allow Francois ask any type of question over the entire set of data.

Right… on Friday I started the coding exercise. I found the Flow Cytometry Standard and opened Visual Studio. Oh my!!!

I had completely forgotten that we were exchanging data using such formats not so many years ago. Binary format, byte offsets, bit order significance, custom encodings for matrices, etc. etc. etc.

It was not long after I started writing the code to process the sample file I was given, when I realized the huge reduction in keystrokes in the world that XML must have introduced. Not because it’s text-based. The common data model, the common data processing rules have enabled an ecosystem of tools that make it trivial to process XML data files, without having to write custom code. The applications have to only worry about the interpretation of data rather than reading/writing/navigating, with semantic computing technologies trying to automate that part as well.

The code is now finished. Lots of custom, non-reusable C# code (following good class design, writing good comments, etc.) in order to process the header segment. I’ve even had to come up with a regular expression in order to tokenize the header segment of the standard (I love trying to come up with regular expressions, even though I am probably not good at it). Here it is… (just in case someone wants to process a Flow Cytometry data file 🙂

/\$((?<variable>[\w-[\d]]+)|(?<variable1>[\w-[\d]]+)(?<parameter>\d+)(?<variable2>[\w-[\d]]*))/(?<value>[^/]+)

I loved the process but surely we shouldn’t have to do this for every data file. No matter whether you love or hate XML, we all have to appreciate the benefits in productivity that it has brought us.

Now, on to reading the actual data and building the solution for Francois.