One Server Broke. They Lost Everything.

Kevin Fang

434,322 views • 1 month ago

Video Summary

In October 2016, a hardware issue in the Strand Data Center at Kings College London led to a catastrophic data loss event. The primary storage system, an HP 3PAR, experienced a failure in one of its controller nodes. Following a replacement of this node by technicians, the entire system imploded due to a firmware incompatibility, resulting in a complete loss of data. This incident impacted IT systems, staff files, student projects, and crucial research data.

The university's backup strategies proved woefully inadequate. Volume-level snapshots were stored on the same machine, a secondary 3PAR system held only limited replicated data, and a backup software solution only covered a select few applications. The tape backups, intended for long-term storage, were plagued by capacity constraints, unnecessary junk data, automation failures reported as successes, and a lack of business risk awareness.

The aftermath involved a grueling month-long restoration process, including forensic recovery and manual reassembly of data structures. While most critical services and shared drives were eventually restored, some data was irretrievably lost. The incident highlighted systemic issues within the IT department, including an overwhelming number of initiatives, infrequent senior IT meetings, and an inability to conduct essential disaster recovery tests due to the single production stack, leading to a significant impact on academic and administrative functions.

Short Highlights

A hardware failure in the Strand Data Center's HP 3PAR system, compounded by a firmware incompatibility during a node replacement, led to complete data loss.
The university's existing backup systems were insufficient, with snapshots on the same machine, limited replication on a secondary system, and a backup software covering only select applications.
Tape backups were severely flawed due to capacity issues, automation failures, and a lack of proper oversight and risk assessment.
The incident resulted in the loss of IT systems, staff files, student projects, and vital research data, impacting multiple departments, including the Institute of Psychiatry, Psychology, and Neuroscience (IOPn).
A complex, month-long restoration process was required, involving forensic recovery and manual data reassembly, with some data irretrievably lost.

Key Details

Hardware Issue and System Implosion [4:31]

In October 2016, a fault was detected in the HP 3PAR system in Kings College London's Strand Data Center.
The system, comprising four controller nodes, lost one node, leaving it with three, a level it was designed to tolerate.
Technicians replaced the faulty node, but after this replacement, the entire machine "imploded," leading to a complete loss of data.
The cause was an incompatibility between the new firmware on the replacement node and the old firmware on the existing hardware, which caused many storage discs to fail.
This failure exceeded the system's RAID protection levels, resulting in complete data loss.
An earlier firmware update from HP could have prevented this issue, but it had not been applied by the IT team.

This section details the initial hardware failure and the subsequent catastrophic event triggered by a firmware incompatibility during a node replacement, ultimately leading to total data loss.

But it turns out the technicians weren't so friendly after all because for some reason after the faulty node was replaced, the entire machine imploded on itself, leading to a complete loss of data.

Backup Strategy Flaws [00:39]

The university's primary storage system was a single HP 3PAR used for IT systems, staff files, student projects, and research data.
While there were redundant systems and backups, the implementation was flawed.
Volume-level snapshots were taken but saved in a different provisioning group on the same machine, offering no protection against hardware failure.
A backup software called VH was used for some systems, but not all data was backed up to it.
Tape backups were used for long-term storage, but with slow read/write times and uncertainty about what data was actually on them due to capacity constraints.
There was no evidence of disaster recovery tests being conducted.

This section outlines the questionable backup practices in place, highlighting how seemingly robust systems and procedures were inadequate for true data protection.

The backup was still on the same machine and that this was the IT equivalent of your grandmother securing her files by saving a copy in a different folder somewhere else on her desktop.

Understanding Parity (PAR) [06:03]

Parity, explained through XOR operations, is a method for data redundancy.
XOR (exclusive OR) is a Boolean function where exactly one input must be true for the output to be true.
Key properties of XOR: identity (any value XORed with zero equals itself) and self-inverse (any value XORed with itself equals zero).
In a parity system, a parity chunk (P) is calculated by XORing multiple data chunks (e.g., A, B, C).
If a data chunk is lost (e.g., B), it can be reconstructed by XORing the remaining data chunks (A, C) with the parity chunk (P).
This is similar to how RAID systems work, but if failures exceed the protection level (e.g., both A and B fail), data loss occurs.

This section provides a technical explanation of how parity works as a data protection mechanism, likening it to RAID functionality and illustrating its potential for data recovery or loss.

Chunk B is equal to the XOR of the other chunks used to calculate the par and the par itself.

Data Restoration and Lingering Issues [08:53]

Following the data loss, critical services like timetabling, payroll, student information, and library systems were restored within days.
Restoration of shared drives took longer, with J, U, and R drives being brought back online by November 11th. The fate of the "N" drive remains unclear.
Some faculty, having lost faith, began storing work on personal devices after the restoration onto new infrastructure with improved backup technologies.
The incident revealed deep-seated issues at the management level, including an overloaded IT team and infrequent senior IT meetings.
There was a lack of willingness to conduct full-scale disaster recovery drills due to the risk of downtime caused by the single production stack.
Despite improvements to the backup system, a full recovery test had not been executed as of the time of the report.

This section details the extensive efforts to recover lost data and restore services, while also highlighting the lingering distrust and systemic problems that persisted within the institution's IT management.