Thanks to a public disclosure request and some helpful folks at SPU’s public information office, I got copies of the monthly reports of Tim Almond, the QA consultant for the troubled NCIS project. NCIS is the new billing and customer-service IT system for Seattle City Light and Seattle Public Utilities that has been under development for the last two and a half years, and is now a year late and $43 million over their original budget. Almond’s monthly reports paint a harrowing picture of a project that went wrong early, often, and predictably.
The stage was set for a large, dramatic failure by four key mistakes that were made early:
Governance. For most of its run, the project was run by a six-person Executive Steering Committee:
- Jeff Bishop, SCL CFO
- Dirk Mahling, SCL IT Director
- Kelly Enright, SCL Customer Care Director
- Melina Thung, SPU Deputy Director Finance and Administration
- Tom Nolan, SPU IT Director (now with Seattle IT)
- Susan Sanchez, SPU Deputy Director, Customer Service
The project was launched in late 2013; it finally had a charter in April of 2014, and a governance model for decision-making in June. But the larger point is that the project didn’t have a leader; it had six — many of whom had little or no experience managing software projects.
Under a long-standing agreement between SCL and SPU, they share both call centers and billing systems; SPU runs the call center, and SCL runs the billing system. So the SCL folks are the only ones with any significant history with the current billing system, and Dirk Mahling is the only one who combines that with any IT experience. Almond’s monthly reports confirm that Mahling’s expertise was key to decision-making — up until he resigned in March of this year (he’s now Vice President of Technology for Alliant Energy).
Another steering committee member, Jeff Bishop, resigned from SCL the previous month ( he is now Senior VP, CFO and Treasurer at GridLiance).
Mahling and Bishop had excellent timing, given that the news of NCIS’s problems became public just after they left. They left behind a steering committee with only one person from SCL. In April, four new members were added to the Executive Steering Committee:
- Mike Haynes, SCL General Operations and Engineering Officer;
- Paula Laschober, SCL Interim CFO;
- Daniel Key, SCL Interim IT Director;
- Ray Hoffman, SPU Director.
It’s particularly notable that Hoffman decided to join the steering committee and have a more direct oversight role over what has become a huge headache for him — and especially since SCL is supposed to be managing the billing system, not SPU. But the point remains: you can’t run a project like this with an 8-person committee, especially when few of them understand IT project management. And the mistakes they made demonstrate exactly why you don’t do this.
Ambition. The customer service and billing system is the centerpiece of their operation, and it needs to exchange information with 40 other applications — both internal and external ones. It supports account management for more than 400,000 customers , is used by more than 600 city employees, and produces 15,000 bills daily. It’s no understatement that this is an absolutely critical piece of IT infrastructure, and the largest and most complex. And they decided to replace it all at once. This is the kind of project that would give seasoned IT managers nightmares — even at companies like Microsoft or Amazon. The issue of trying to update interfaces to 40 legacy applications alone is reason to try to find ways to scale it back, or at least divide it up into manageable pieces that can be taken in stages, because the risk goes up exponentially with increased size and complexity. Replacing it all at once should have been the “last resort” option.
They wrote the budget and the schedule before they wrote the specification. At the point they defined the initial $66 million budget, they had an initial list of business requirements — which was revised significantly over the following year. It wasn’t until February of 2015 that the project scope was finally agreed to, and design specification documents were still being written for months after that.
The only way to meet a budget and schedule that was defined before you know what you need to build is by committing to build the wrong product. The budget and schedule overrun were inevitable from the beginning.
They understaffed the team. It was decided early on that project design and implementation would be split between SCL’s internal IT resources and an outside contractor (Price Waterhouse Coopers aka PwC). But the PwC resources were limited by budget, and the SCL IT resources weren’t enough to cover everything required of them — including reviewing PwC’s work, writing specifications, and testing. As the months went by and holes showed up, they asked SCL IT staff to do additional work, and contracted with PwC to do extra work (budget be damned). Of course, since they weren’t adding SCL IT staff, just shifting them to different assignments, that didn’t actually give them more capacity to get work done; it just got them officially more behind and put more pressure on the staff.
But getting the project off to a bad start wasn’t enough; they made several ongoing mistakes:
They bet on technologies that their staff had no experience in. The first was to adopt PwC’s Transformation Design process for revising business processes; SCL IT staff was unfamiliar with the process and some of them struggled to get on board with it. The second was a prolonged debate about which underlying software package for printing bills and other documents. Almond identified this as an issue as early as January 2014, since bill printing is a critical, highly visible and complex component that needs to be right on day one, and the old system used a component that was no longer supported. The Executive Steering Committee decided in April of 2014 to have the SCL IT staff build bill and document printing on top Oracle’s document production system, and then in May reversed themselves and stick with the old unsupported component and try to negotiate an extension of support for it. As of September 2014, bill print still had no scope specified and no schedule; that finally got fixed in October. By December it was already weeks behind schedule; in January 2015 they started cutting features. Things got better for a while: in February they printed their first bill, and over the next few months they made significant progress. But by July they had discovered significant defects in the underlying print component, and by August they had a list of over 30 defects — and the time to get a fix the each of them was averaging over 30 days. In October it was 42 defects, so significant that the team had still not managed to do any end-to-end testing of the billing system. In February of 2016 the print component still had 24 open defects; as of the end of April there were 10, and three weeks ago there were still 5. And to be clear, those aren’t bugs in the bills — those are bugs in the underlying components that the bills are designed on top of. The bills themselves still can’t be completely tested until the underlying component is fixed.
But that’s not all: management made a decision to re-architect the interfaces between the NCIS system and those 40 other applications to a state-of-the art design known as “Service Oriented Architecture” or SOA. That’s a perfectly good decision to make, except for two things: it added significant more work and complexity to a system that was already too big, too much work, and too complex; and the SCL IT staff had no experience with it. From Almond’s August 2014 report:
This technology is new to SCL and SPU and as such there are few resources that are conversant with either the establishment of the technology itself… or implementation of specific interfaces needed to support the NCIS project. Resources are in short supply (in the general market) and are expensive. In addition, the establishment of the framework itself can be expensive.
Over time, several pieces of technical work involving SOA that were originally assigned to SCL IT staff needed to be re-assigned to PwC staff because they were the only ones with the technical knowledge and skills to do the work. I should clarify that this is in no way a criticism of the SCL IT staff; new technologies get adopted all the time and IT staff are constantly required to learn them. But you build into the schedule the time for the staff to train up, and in this case management didn’t do that. So in August 2014 Almond was flagging as an issue that SCL IT staff weren’t capable of doing the work assigned to them, and in September management was scrambling to outsource some to PWC. More time, more money, more fumbling around.
They didn’t have top-down discipline on managing external dependencies. Beyond the issues with the bill print system, each of those 40 applications NCIS needed to work with was an external dependency. Some are owned and maintained by the SPU and SCL staff working on NCIS; some by other SPU and SCL staff; some by other city departments, and some by outside companies such as Waste Management, the city’s garbage and recycling contractor. The NCIS team needed to have constant communication with all of those teams: to exchange technical information and data; to synch up on project schedules; to cooperatively debug and fix defects; and to plan joint testing activities such as their “dry run” exercises. But the communication between teams has been inconsistent. That came to a head particularly during the “day in the life” (DITL) and “dry run” (DR) tests where the NCIS team tried to stage the full system with all parts running, and several external teams either declined to participate or were not ready to. From Almond’s December 2015 report on Dry Run 2 (DR2):
The assessment for DR3 in late January was only slightly better (from Almond’s February 2016 report):
It’s clear from reading Almond’s reports that the relationships between NCIS and some of the other teams are not strong. But it’s even more clear that the SCL and SPU executives have not communicated that NCIS is a make-or-break top priority for the organizations — and the partner departments and contractors– and everyone needs to be contributing to its success. As of three weeks ago, the interfaces with the Outage Management System, the GIS system, and Waste Management were still high risk.
They didn’t have discipline around project management of their schedule. IT projects are broken down into milestones: requirements, specifications, implementation, and increasingly rigorous forms of testing. The end of a milestone is critically important, because if planned correctly it brings the effort to a point of stability; the next milestone then begins cleanly and it’s much easier to tell how much work still needs to be done. But because the team was understaffed and the schedule was set before the specification were written, there was tremendous pressure to constantly move forward into the next milestone even if the last one wasn’t complete. The team succumbed to this pressure with nearly every milestone: they began implementation before the specs were done, their first implementation milestone ran on into the second one, and their testing phases all merged together. In May of 2015 it was clear they would not make their original October “go live” date, but they couldn’t predict the new one. In June the project management team suggested a new date, and the Executive Steering Committee rejected it and told them to look at an earlier one (a red flag that should have had people running for the exits). In July they agreed on April 2016 as their new target go-live date; but it was no better than a guess because without ever stabilizing at the end of a milestone they had no clear idea of how much work truly remained. That became obvious in the months that followed; while they were officially in “integration testing” phase, several components were still being implemented. Integration Testing phase 1 (IT1) rolled into phase 2 (IT2), and IT2 rolled into “Day in the Life” (DITL) testing. Though it was hardly that, and Almond noted that it could best be described as “IT2 phase 2.”
The chart of test results makes it clear how dire the situation was.
The lumpiness is typical of testing phases: just after a bunch of components are added none of the tests work (because the system is broken and nothing works); then as the integration problems start to get fixed the tests can be run but many of them fail; then as bugs get fixed the fails (in red) change to passes (in blue). But what this graph shows is that major functionality is still being integrated throughout the fall and winter, and the system is never stabilizing. In fact, it completely breaks several times leading to complaints that the test team has no work to do because none of their tests will execute.
Here’s how the test results chart looked by the middle of March:
There’s a lot of good news here, as the core system had stabilized, and tests were running — and largely passing. But huge problems still remained for the project: there were still 165 serious defects to be fixed, and a large quantity of tests had still not been executed even once — making it unclear how many bugs were still to be found. Plus, integration of some of the external applications still was not complete. And there was one area where implementation was still happening: the “Customer Self Service” (CSS) portal, i.e. the web site where customers check their own bills, make payments, and take care of other account management tasks. From Almond’s February 2016 report:
Translated into English: the CSS implementation is so far behind (work wasn’t started on it until September 2014) that they had to break it up into three parts: the first one, delivered on “go live” day, will only work for a small number of customers and the rest of us will have to continue using the old interface on a “snapshot” of data that may not be correct; the second, a month later, will give basic functionality for everyone but will still be less functional than what SCL and SPU users have today; and the third in “early 2017” will allow customers to login once to see both SCL and SPU bills (as they can do today).
As of last month (May), some things had continued to get better, including stability of the core part of the system. But the peripheral components continue to be trouble — and this is where their lack of discipline on external dependencies comes back to bite them again. From Almond’s May report:
He goes on to describe the state of testing:
And more:
Do they have a chance of making their September 2016 “go live” date? Yes, they do. But if they do make it, what goes live will be ugly from the customers’ point of view. The CSS portal won’t be live yet, bills may not print, some of the 40 external applications may not work correctly, and calls to the Call Center will be high. Speaking of the Call Center:
They made a mess of the plan to train staff on the new applications. In December of 2015, the team began training sessions for the 600 staff people who would use the new system, including Call Center staff. To facilitate that, they set up a training environment so staff could practice. But since at that time almost nothing worked, the training system was largely unusable. They decided not to update it with newer builds in order that they remain “stable and predictable.” By February 2016 management was hearing extremely negative feedback on the training sessions, including that it was focused on teaching them about NCIS, and not about how to do their jobs — when you redesign business processes as part of an IT system overhaul, as they have done extensively with NCIS, you need to re-teach people how their jobs are done. They had also forgotten to plan for keeping the call center running while they pulled staff out for training; in December only 16% of calls were answered within one minute (their target is 80%). By February the call wait times were in excess of 10 minutes, and the call abandonment rate was at 33% (their target is 5%). In March, the abandonment rate grew to 44%. In April they finally started to get the call center back under control, by slowing down the training schedule and putting in place a plan to hire an additional 7 customer service representatives; the abandonment rate dropped back down to 9%. In May they expanded the plan to add 13 temporary positions and 5-6 additional customer service representatives. In May they rolled out revised training materials that are more “procedural.”
They don’t have a plan for what happens post-launch. After NCIS goes live, things will break and there will need to be technical support available to fix them. There will also be requests to add new features, and the deferred work on CSS will continue until at least December. And of course there will need to be a governance model that oversees all of this, including the allocation of technical resources to do the work, schedule, budget, training, and rollout. As of the end of May, none of that is in place.
The ESC has a “stop-gap” plan for technical support using SCL staff, but with so much work having been done by PwC over the course of the project, it’s clear that the SCL staff don’t have the technical expertise to fix everything that might break. They have issued an RFP for an outside contractor to provide technical support services and have received responses, but as of the end of May had not made a decision on whether to move forward.
What complicates matters further is the creation of the new, centralized Seattle IT department, under Michael Mattmiller. The Mayor has indicated that he expects nearly all IT staff, across the city departments, to move into Seattle IT. That raises the question, then, of which department becomes responsible for NCIS. Since Seattle IT will have the personnel, it is the most likely candidate to own it going forward, but that blows up the long-standing agreement between SCL and SPU under which SPU runs the call center and SCL runs the billing system. The three organizations have yet to announce a decision as to who will be responsible for what pieces — systems, personnel, and budget.
They haven’t learned how to be honest and transparent about the project. On April 1st of this year, SCL General Manager Larry Weis, SPU General Manager Ray Hoffman, and Seattle IT head Michael Mattmiller issued a joint letter to the City Council (and the Mayor put out a concurrent press release) saying that the NCIS rollout was being delayed “while an exhaustive final check is conducted to make sure the system functions properly.” On May 10, several members of the Executive Steering committee appeared before the City Council and explained that in their third “dress rehearsal” at the end of January they had some problems with the conversion process and were still struggling with some of their testing, and that was why they decided to delay launch. They were a little more honest in their slides, though with a decidedly positive spin and at such a high level as to obscure the fundamental issues:
They also blamed the first schedule delay, in July of 2015, on the “cyber security landscape,” “new workforce management software,” and “lessons learned from other implementations.”
But it’s clear that isn’t at all what happened; they missed their schedule, twice, for all the reasons outlined above. This wasn’t a problem with SCL and SPU staff, who have worked very hard for two and a half years on NCIS and have been denied vacation time for much of that time in order to try to keep the project on track. This is a problem with SCL and SPU management.
So what are the lessons for the City Council from this? The Council is the legislative branch, not the executive; they won’t ever directly manage a project like this — and they shouldn’t — but they hold the purse strings and can use their budgetary control and legislative power to force a more mature process on these sorts of major projects, or pull the plug on them when the executive is unable or unwilling to comply. Specifically, I think there are four important principles that the Council should learn from the NCIS debacle:
- Curb your ambitions. Doing a big, unwieldy project like this one should be the option of last resort, after all other alternatives that would break it up into smaller, more manageable pieces have been exhausted.
- Budget and build in phases. You can’t budget and plan accurately when you don’t know what you’re building. The best practice, very similar to buildings, is to budget and execute on the design phase, then let that inform the budget, staffing level, and plan for the implementation, delivery, and ongoing support.
- Insist that one person is in charge, and that the person understands how to design and build large IT projects. It’s good to have a committee of stakeholders informing the team’s work; CFO’s and customer service directors have important contributions to make to ensuring that big IT investments are meeting the right needs. But you can’t run a large software project by committee. At the end of the day, you need one person with experience in IT project management who has overall responsibility for the project and the authority to make the difficult decisions.
- Reward transparency and adoption of best practices. Lots of things went wrong on the project because the team, and the steering committee, felt pressure to continue to push forward when the right thing to do was to stop and regroup. The result was a set of self-inflicted wounds that accumulated over time and made the project more difficult to manage. In truth, the NCIS project was probably always destined to take this long and cost this much; there was simply too much work and too much complexity involved to be able to do it faster or cheaper. Hiding the evidence of that truth accomplished nothing but cost SCL an SPU their reputations to be able to competently manage a project like this.
And to that end, the creation of the new Seattle IT department might be a blessing, particularly if it becomes the driver of mature IT practices across the city’s departments. SeattleIT is too new for there to be any evidence as to whether that will be the case, but the final lesson from NCIS is that the Council should be pressing the new IT organization to reform the city’s ways — including establishing a new rigorous oversight regime for major projects. This may be the best opportunity the Council will get to prevent the next NCIS.