Backup power equipment failures 'brought GNB down'
Government silent on cost of IT infrastructure failure repairs, recovery
Premier Brian Gallant says he will "get to the bottom" of what happened in June when government servers experienced two hard crashes in one morning after backup power equipment failed at the Marysville Data Centre (MDC).
The MDC houses IT systems used by the departments of Justice, Health, Public Safety, Social Development, Finance and more.
The repercussions were far-reaching, took days to repair and even solicited involvement from then-premier David Alward.
"Our government takes the responsibility of protecting New Brunswickers’ privacy and personal data seriously," Gallant said in a statement on Tuesday, in response to a CBC News investigation about the unplanned power outage.
The government will "take steps to ensure it doesn't happen again," Gallant said.
An external review initiated by the former government is already underway, he said. "It is our intention to release findings and recommendations in as timely a way as possible."
The events began the morning of June 9. An osprey was building a nest on a NB Power transmission line.
The bird, one of hundreds that do the same thing across the province’s power grid every spring, somehow shorted the power during its build. Customers around the greater Fredericton area lost power, including the Marysville Data Centre.
The first warning came at 9:27 a.m.: “We're running on battery power for now,” wrote Trevor McDonald with the Internal Services Agency (NBISA).
Data centre outage impacts
- Some computer services go down for the justice system, leaving court "decisions in limbo because of documents they can't access . . . [and] people are possibly sitting in holding [cells] that could be, well, not in holding." A court decision involving some shale gas protesters is delayed because the judge can't access computer files.
- Major systems in the health and motor vehicle department go down. Concern is expressed about health-care providers in the health department's mobile crisis unit for mental health and addictions not having computer access to a client's history at their fingertips in critical cases during one of the planned outages for repairs to the data centre.
- Service New Brunswick turns away people wanting to renew their drivers licence. Its website is also offline. Automobile dealers were unable to complete online paperwork on the Service New Brunswick system
- Some civil servants were unable to work for hours, or even days. Overtime was needed in some cases to clear backlogs.
- NB Liquor was unable to process credit card or debit card transactions at its stores, leaving its retail operation "basically dead in the water" for hours
“If power doesn’t return in the next 15-20 minutes, there will be a hard shutdown of systems.”
There was a problem. An Automatic Transfer Switch (ATS), which is supposed to connect the data centre to a diesel generator when street power is lost, didn’t work.
While technicians rushed to do the switch manually, the data centre was running on a giant battery pack called an Uninterruptible Power Supply (UPS). But the batteries ran out before the backup diesel generator could power the data centre.
“The government of NB mainframe was gracefully shut down before the UPS batteries hit their threshold to support the load, all other systems went down hard,” wrote a Bell Aliant operations manager at the data centre.
The switch to diesel power was made, but two hours later the data centre hard-crashed again while technicians were trying to repair the ATS. This time the mainframe too, went down hard.
'Yes. No. F--k it is bad'
IT staff worked all day and into the early morning hours the next day to restore critical systems, corporate services and to bring back public-facing websites once the backup power was restored.
Internal Services employees working on the restore characterized the situation with colourful language in their correspondence. “Yes. No. F--k it is bad.” responded one employee when asked if there is an ETA on recovery.
Another wrote “Yeah … whole data centre basically puking. Power issues.” when someone inquired about their lost connection to various programs.
In the public too, commentary was flying. Melanie Morris tweeted, “My one day off this week I go to renew my license and Service New Brunswick’s systems are down. Hopefully I won’t get pulled over!”
She told CBC News that Service New Brunswick employees politely told her they couldn’t renew her licence because all of their systems were down.
“To get there and find out nothing is working, turn back around, come out, I was a little frustrated.”
She added she returned the following Saturday to renew her licence.
“There were a lot of things that ended up being adjourned,” says criminal defence lawyer Alison Ménard.
On June 19 Ménard’s clients, Germain Junior Breau of Upper Rexton, N.B., and Aaron Francis of Eskasoni, N.S., were in custody after having previously pleaded guilty to some charges related to the violent fracking protests on Oct. 17, 2013.
The pair was awaiting a decision on other charges they had refuted in relation to the violence, but that decision was delayed because Justice Leslie Jackson couldn’t access critical files in the case.
“It definitely leads to frustration for people who are incarcerated,” said Ménard.
On June 11 a technician was assigned to help specifically with issues hindering the work of the justice system. One email remarks:
“Yeah, and it gets worse. They have decisions in limbo because of documents they can’t access … so people are possibly sitting in holding that could be, well, not in holding. I wonder if they thought out who’s network drives should have been restored first.”
The hard-crashes caused a ripple effect in some departments. Data-corruption in data storage systems caused issues which took days to recover from in a number of departments.
A June 10 email from Christian Couturier, the province's chief information officer, states: “The bigger issues are with data potentially corrupted on the storage array. This is translating into major systems being down (in health and motor vehicle). The shared services agency is working to resolve (identify and restore data). Rough day here.”
A June 17 email from an Angie Milbury, a NBISA director, states “The purpose of this message is to explain to IT directors the approach we are taking to ensure all GNB File Server data affected by the outage is recovered.”
“In order to effectively deal with the large number of restores resulting from the outage we are finalizing a plan to do a restore of file server data from the June 6th backup…” it continued.
'You’re playing with fire'
On the morning of the June 9, NB Power had restored power within an hour, but in the course of the first crash, the UPS was “fried.” An email with the subject line “fuse” and an image attached says, “This is what brought GNB down.”
Records show it was soon learned that more than fuses were fried in the 25-year-old unit. Without the UPS, the MDC could not be switched back to street power. It continued to operate on diesel power for two weeks.
Emails state the rate of fuel consumption ranged from 50 to 85 litres an hour.
Further complicating the issue the morning June 9, after the effort it took to connect the diesel power to the MDC, the generator started having problems maintaining frequency, prompting a scramble to find a portable generator to replace the usual redundancy.
It is not clear from the records exactly what caused the backup power system to fail. There was regular maintenance and testing performed on the system. In fact there was maintenance to the UPS the day before the outage.
“... if it’s only one generator [as a back up], then you’re not redundant. So basically you’re playing with fire,” said Bertini.
To have equipment that is 25 years old that raises a flag right away- Stéphane Bertini, president of Montreal-based Zonesa
“Nothing is 100 per cent fool-proof, but there are ways to maximize uptime. And it’s basic ways. If you know what you’re doing it is pretty easy to guarantee uptime. But again, it’s money too. If you don’t have the money to have twice the equipment then you’re stuck. This could be a question of money.”
Bertini isn’t alone in wondering what was behind the failure.
'Significant financial and productivity impact'
Ten days after the crash then-premier David Alward wrote to the province’s top bureaucrat, Marc Léger, the chief of the executive council and secretary to cabinet.
Alward asked Léger to head up a committee to examine the technology failure and to figure out how to avoid anything similar in the future.
“As you are aware, the failure at Marysville Place that started June 9, 2014, had a significant financial and productivity impact on the Government of New Brunswick. I am aware that it negatively impacted services to citizens and businesses, and disrupted offices across the network,” stated Alward.
Were you affected by the loss of provincial government computer service after the osprey-related power outage? Contact us at [email protected] to tell us your story.
In addition to the clerk’s committee, Auditor General Kim McPherson is also examining the circumstances surrounding the outage to determine whether it should be a subject in her next report.
CBC News also learned that Ernst & Young was commissioned to study the outage and was due to deliver its draft report to the clerk’s committee Monday, but government would neither confirm nor deny that was the case.
CBC News requested both research interviews and on-camera interviews, and sent dozens of questions to seven different government departments, asking about the impacts of the outage, seeking to understand in better detail exactly what the real impacts were, how long they lasted and how they were resolved, as well as how much the outage cost taxpayers — for replacement parts, overtime, and productivity downtime.
None of the questions were answered and all of the interviews were refused.
In a statement, Government Services spokesperson Sarah Ketcheson wrote, “At this time, an incident review of the June 9 outage is being undertaken. This is a standard procedure when an event occurs which has an impact on our IT system. Once the review is complete and government has had the appropriate amount of time to study the final report, we will be in a better position to respond to your questions.”
'Smoke and stuff'
On June 20, two days after the premier’s letter and almost two weeks after the initial outage, the diesel generator powering MDC failed.
An 8:18 a.m. email from an NBISA employee states most services are back up again after “the portable generator went down hard this morning (smoke and stuff apparently).”
Two days later, a temporary UPS, housed in a tractor-trailer on rent from a Pennsylvania company for about $52,000 a month, was connected and allowed the MDC to be reconnected to street power.