Astronomical Data Processing Challenges
This past week I was at the Wolfram Data Summit 2010, a gathering of “data processing folks”. Wolfram, of course being responsible for both, Mathematica (Of which version 8 is coming very soon), and the WolframAlpha “Computational Engine”. One of the interesting things that struck me as interesting was that most of the presenters have statistics representative of large populations – but their data ends up being a small number of actual data points. However, one presentation stuck out as being exactly opposite. And, ironically, it was on astronomical data!
Alberto Conti, from the Space Telescope Science Institute offered a colorful presentation of some of the problems facing astronomers today. Unlike many disciplines, he asserted that “astronomical data has no financial value”, leaving most businesses far away from having interest in dealing with their problems. And their problems are big! (“As big as the universe!”)
In particular, to take a modest resolution, color image of “space”, would require nearly 100TB of storage. Per image! And, as you could imagine, the amount of data the STSCI has is excessively large. Much of his talk seemed to highlight the Hadoop model of “move the computation to the data” as moving the data around is challenging. That is, at allow researchers to send in the formulas and allow some computational cluster to run them against the archived data.
I had the pleasure of sitting down for lunch to get a bit more of a feel around the solutions to these immense processing problems. One was a “SETI” like distributed “freeware” application. Some problems, such as image processing might work well. However, the “time differential” problems (comparing the same section of sky over time) could be more difficult. The issue here is that the image data is often taken at different resolutions and different scales making it challenging simply to match up each frame – and one might have to keep going back to get more of those “big data chunks” to continue.
Another area was the STSCI’s data interface. A quick visit to the STSCI’s Data Archive reveals a clunky “circa 2000 web interface”, with no programmable APIs or visualization tools. High on the list of things to improve are this interface to increase overall data availability. The thought here is simple: Improved visibility and access to data will enable researchers to operate more efficiently. And, as the web has shown time and time again, open access breeds tinkerers who come up with innovative solutions – at no development cost to the API owner. I certainly look forward to seeing the next generation of exposed tools here; it should be fun to tinker with!