New Brunswick

Data centre outage cost $1.6M in equipment, lost productivity

Recovery from a massive failure at the Marysville Data Centre in June cost the provincial government $1.6 million, according to a new government report.

A new report from Ernst & Young describes a 'perfect storm' of failures

A new government report says recovery from the Marysville Data Centre outage in June cost the provincial government $1.1 million. (Courtesy mynewbrunswick.ca)

The failure of the Marysville Data Centre in June cost the province an estimated $1.6 million in lost productivity and recovery, according to a new government report.

The report points out that managers were alerted to the age and condition of critical backup power equipment, and the risks that posed, before the failure.

It also states the Internal Services Agency, the province’s IT agency, doesn’t understand the “criticality of the systems hosted in the data centre or the system recovery priorities during an event of this nature.”

Ernst and Young estimates $1.1 million was spent on recovery and $500,000 went to lost productivity after critical government computer systems crashed during a power outage.

The data centre’s back up power systems failed to protect the government’s IT centre, which houses programs for departments like health, justice, finance, public safety and government services. That caused various program outages in multiple departments.

Premier Brian Gallant said on Wednesday the report was a positive step, but should have come sooner.

Premier Brian Gallant says the report recommendations are under consideration. (Jacques Poitras/CBC)
“Obviously we would have liked to have that work done before any event would have caused the damage that it did, but at the very least we do have a strong report.”

“Unfortunately, we see that all too often in any organization. Government is no different. We could see that there might be a potential problem, but we don’t act proactively enough to make sure that we thwart that even before it began," he added.

The final report, submitted on Oct. 29, indicated several problems with the backup power equipment.

“Following the initial electrical service interruption, the electrical system Automatic Transfer Switch (ATS), Uninterruptable Power Supply unit (UPS), on-site standby generator and the subsequently procured temporary portable generator all failed; a 'perfect storm' of coincidental equipment failures."

The report said the equipment was being maintained but age and degree of use of certain equipment components may have been contributing factors to the failures.

The power outage which preceded the IT outages was caused by an osprey building a nest on a transmission line tower.

Government didn't have a recovery plan

The report repeatedly points to the fact that there was no disaster recovery plan, or an overall major incident management process within the Internal Services Agency.

NBISA does “not have a clear understanding of the criticality of the systems hosted in the data centre or the system recovery priorities during an event of this nature,” it stated.

“NBISA does not have a disaster recovery plan for the data centre and hence had no option for restoring service other than to resolve the problems at the Marysville data centre.”

The report also states that old equipment and a lack of redundancies posed failure risks: “[The] facility was supported with a single UPS (no back-up) that was over 22 years old and approaching EOL [end of life]; notifications of this situation and its risks were made to management in advance of the incident.”

CBC News has requested an interview with NBISA or the office of the chief information officer to try and learn more about why the risks weren’t addressed when raised and about the contents of the report generally.

Serious issues faced by departments

The report points to some serious issues experienced by departments whose IT programs were housed at the data centre.

It is not clear from the report how exactly these issues may have impacted New Brunswickers directly. 

The report states no "public safety, property, or life-threatening impacts were reported."

  • Social workers and after-hours emergency social services staff were unable to receive alerts about unsafe conditions for children in Child Protection Services.
  • In Public Safety, the report states: “The report listing the clients due to be released from custody could not be generated.” Staff used manual files to determine who should be released from custody.
  • Neither court-ordered probation reports, nor victim impact statements, could be generated while systems were down.
  • NB Liquor, which generates “an average daily revenue of $1M in liquor sales” couldn’t process debit or credit transactions for about two hours on the first day of the outage. It could still process cash. The review states it could not determine the lost revenue amount. NBL has said it does not think it was significant.
  • Health “also reported that there were compliance issues with the Medicare Program related to the use of public folders (i.e. some clients did not receive feedback or follow-up within the guaranteed timeframe),” states the report.

CBC News has requested more information on the compliance issues from the Department of Health but did not receive a response. 

Disaster management plan recommended

The report recommends government develop an IT disaster management plan and says the lack of redundancy for critical systems and equipment should be addressed.

It also notes there is still old equipment in use and recommends it should be assessed and potentially replaced.

It also calls for the creation of a data centre facilities strategy, with a goal of "improving IT services performance and reliability and reducing long-term costs for data centre services."

The review was demanded by former premier David Alward after repeated outages and persisting IT issues across government departments, such as inaccessible files and corrupted data.

Its release follows a CBC News investigation in October which revealed more information about the scope of the problems caused by the outage, and which showed government communications officers deliberately tried to limit the information that was released to the public about the outage and recovery efforts.