Data degradation in life sciences
Authored by Daniel Hickmore, VP of Health & Life Sciences @ Arkivum
The final economic fallacy is that long term data management is temporarily dynamic and path dependent. I would argue that data is certainly temporarily dependent, but not path dependent. Data is constantly degrading in every temporal dimension; the past, today and the future. We are constantly trying to save and find old data, protect data today and preserve data into the future. There are a number of variables for us to consider which are discussed below.
We all think about our pension, but not all the time. Most of us deal with our pension intermittently. The time and attention changes as we get older, the portfolio also changes in a similar fashion. The problem is, as I discussed in previous articles, data is not durable, it is not like a house. In research and pharma for R&D and regulated documentation, data value often appreciates in value over time. It is more like a fragile oil painting that needs constant and careful attention. To give you just one example, when the BBC were digitising some of their recorded holdings by migrating onto video tapes, they found that the costs went up 5x if an original tape didn’t play back first time. The longer they waited the higher the costs because more tapes degraded in the meantime. Worse still, treating the casualties first, i.e. the degraded tapes, means that the ‘good stuff’ goes to the back of the queue and creates further causalities from the extra time of all that waiting in line.
Unlike paper in a warehouse, no action causes digital data to become derelict
Long term data management and preservation does not react well to benign neglect. Unlike paper in a warehouse, no action causes digital data to become derelict. Traditionally, considerations about data happen during obvious points of vulnerability such as a technology refresh, an application or infrastructure end of life of system, or a change in the data lifecycle. The problem is that this does not take into account the slow degradation due to increasing difficult in legibility as file formats change, or storage medium degrades over time. An approach of intermittently worrying about the data will not work. Losing half a spreadsheet is as bad as losing the whole; losing the metadata which proves integrity is as good as losing everything else.
Loss of data that supports research and data degradation in life sciences can have a great impact on the research authenticity, but also can be disproportionality expensive to replace. It is often not the data, but the processes, equipment, resources and materials that are needed to replicate the data. If the data is time dependent, such as clinical data on a patient, then it is irreplaceable.
Data preservation and long-term management is not linear, it is a dynamic process and needs to be managed by systems that are similarly dynamic. The data management process has to be person independent and application/cloud vendor and infrastructure neutral to ensure there is no lock-in. this approach means critical vulnerabilities such as organisational hand-off, technology refreshes or many other similar events can be handled with a minimum of risk.
The approach to this challenge is to incorporate these factors into a data risk registry and data can then be managed through the risk register allowing for prioritisation and resource allocation.
In the final article we will discuss how to bring this all together and look at your long-term data management and preservation as a whole. For further reading on data lifecycle management, click here.