Experiments at the European XFEL will produce an incredible amount of data—all of which needs to somehow be stored and made available for analysis.
Before they can leverage their experiments at the European XFEL to generate new insights, scientists will have to dig through an enormous amount of data. Take one of the two-dimensional pixel detectors. Each detector will deliver 10 to 40 gigabytes of data—enough information to fill up to over seven DVDs—every second.
Operating all six instruments will, according to current estimates, produce 10 million gigabytes (10 petabytes) of data per year, increasing to over 50 million gigabytes per year as a result of improved detector resolution. Storing 50 million gigabytes of data would require 10 million DVDs, which, if stacked on top of one another, would be 12 kilometres high. In comparison, the four experiments at the Large Hadron Collider produce about 13 million gigabytes per year.
The extremely large data volumes generated at X-ray free-electron lasers require a new way of thinking about how data is managed and analysed. At conventional X-ray labs, scientists are able to bring their own hard disk drives, copy their data onto the drives, and then do their analyses at home. In the case of the European XFEL, the sheer amount of data will not make this approach possible any longer.
At the European XFEL, data will be stored securely in a large disk system, exploiting technologies similar to those used by companies such as Google. Data processing services will be provided as well. Overall, the computing infrastructure will help scientists do their jobs—everything from moving samples like nanocrystals, to storing, mining, and analysing data, to visualising the results.
Some features of the envisaged data handling system at the European XFEL:
- Initial size of the storage system will be 10 million gigabytes, increasing over time to 50 million gigabytes or more.
- Lossless data compression will be applied on the fly whenever possible. For single small biological molecules, the data can be compressed to five percent of its original size. Solids, liquids, and gases do not allow such extreme compression rates.
- Disks will be used to store raw data as well as results from scientific analysis for about one year. After that, all raw data are moved to a tape archive for long-term storage.
- Computing clusters close to the data archive will be used to analyse the data. Estimates indicate that 2 000 processor cores per petabyte of stored data will be needed to perform scientific analysis. For 10 million gigabytes, this corresponds to about 2 000 desktop or 200 large server machines.