Data formats – The adventures of reading a Flow Cytometry data file

I try to spend few hours a week at the Armbrust Lab, trying to understand what the scientists there are trying to do and how their work could be helped by technology-based automation solutions, especially around the area of data management. The week before I sat in their weekly meeting, where Francois Ribalet talked about the problem he was facing in trying to process a number, rather than just one, Flow Cytometry data files. I thought… well, that’s a very easy to solve problem, right? That could be an easy-to-build solution… read the data, move it to Excel or SQL Server (or mySQL or Oracle… it doesn’t matter), and then allow Francois ask any type of question over the entire set of data.

Right… on Friday I started the coding exercise. I found the Flow Cytometry Standard and opened Visual Studio. Oh my!!!

I had completely forgotten that we were exchanging data using such formats not so many years ago. Binary format, byte offsets, bit order significance, custom encodings for matrices, etc. etc. etc.

It was not long after I started writing the code to process the sample file I was given, when I realized the huge reduction in keystrokes in the world that XML must have introduced. Not because it’s text-based. The common data model, the common data processing rules have enabled an ecosystem of tools that make it trivial to process XML data files, without having to write custom code. The applications have to only worry about the interpretation of data rather than reading/writing/navigating, with semantic computing technologies trying to automate that part as well.

The code is now finished. Lots of custom, non-reusable C# code (following good class design, writing good comments, etc.) in order to process the header segment. I’ve even had to come up with a regular expression in order to tokenize the header segment of the standard (I love trying to come up with regular expressions, even though I am probably not good at it). Here it is… (just in case someone wants to process a Flow Cytometry data file 🙂

/\$((?<variable>[\w-[\d]]+)|(?<variable1>[\w-[\d]]+)(?<parameter>\d+)(?<variable2>[\w-[\d]]*))/(?<value>[^/]+)

I loved the process but surely we shouldn’t have to do this for every data file. No matter whether you love or hate XML, we all have to appreciate the benefits in productivity that it has brought us.

Now, on to reading the actual data and building the solution for Francois.

Recent Posts

Digital Twin (my playground)

I am embarking on a side project that involves memory and multimodal understanding for an…

2 months ago

“This is exactly what LLMs are made for”

I was in Toronto, Canada. I'm on the flight back home now. The trip was…

9 months ago

AI is enhancing me

AI as an enhancer of human abilities.

10 months ago

“How we fell out of love with voice assistants”

The BBC article "How we fell out of love with voice assistants" by Katherine Latham…

1 year ago

Ontology-based reasoning with ChatGPT’s help

Like so many others out there, I played a bit with ChatGPT. I noticed examples…

1 year ago

Break from work

Hi all… It’s been a while since I posted on this blog. It’s been an…

2 years ago