... customers in motion? - Resilience Ninja

A lot has been written in the past week about the problems experienced by Research In Motion (RIM) and their Blackberry platform. Some of it highlights significant failings in the way we approach Business Continuity.

For those of you who missed the story, here is the very short version. Outage was originally experienced in the UK on the morning of Tuesday, 11th October. The impact was felt fairly soon after across Europe and Africa – by Thursday (13th) the impact was being felt in the Americas.

There are >70 million subscribers (world-wide) to the Blackberry service, most of whom found themselves without access to email and some internet services over a period of 3 days.

Continuity Central ran an article about the outage being caused by a failure of business continuity, not sure I would agree with that exactly – but it does represent some poor reflection on BCM practices. Certainly it is poor customer service that such an outage occured, but I think it is even sadder that it appears many of the folk responsible for contingency planning, in customer organisations, had no idea about the limitations of the platform their companies rely on.

RIM reported that the outage began when a switch failed in their UK Data Centre, there was a backup device but “the failover did not function as previously tested.”

The observation is made in the article that successful tests do not guarantee that things will work on the day. I want to know who said the test was successful! Two questions I would ask, that I have not seen asked yet;

How was it tested?

If it didnt involve pulling the power plug out under load, then ask yourself what you are testing.
Sure you need to build up to this, but a stage managed exercise without load is not adequate testing for a platform like this.

How frequently?

This outage seems to be releated to volume, volume is increasing over time.
It is also an issue of concentration risk.
Have they tested on recent volumes and traffic patterns?

I would suggest that the first BCM lesson from this outage should be;

Practice does not make perfect, perfect practice makes perfect.

No arguements from me with the next point made in the Continuity Central piece, that ‘Holistic BC’ needs to make provisions for the failure of contingency strategies. But we need to make the point clearer that this is achieved by having early detection processes, adaptive capacity and a capability for immediate response – not more detailed plans .

Thats why I think they have it the wrong way around at the end.

Business Continuity is not a substitute for High Availability.

Actually they are talking about recovery strategies, not real continuous operation. RIM need continuous operation, implementing a Disaster Recovery option (being defined as a solution where no service is delivered for a period of time) is exactly what RIM have done – and look at the damage it caused. Indications are that the volume of traffic backlogs created by having to implement a DR strategy (and the associated message processing backlogs) have contributed to the geographic spread and impact of the outage.

Some point out that this UK data centre appears to be a single point of failure, which looks to be true. One would assume that this is acceptable to Blackberry users. Until the most recent version of the Blackberry Enterprise Server (the piece that sits in the clients network and talks to the in-house email system) it did not support clustering and High Availability. On that basis, single point of failure has been acceptable to corporate Blackberry users (and their contingency planners) for many years. I have seen nothing about the effectiveness of the workarounds they must have had in place.

Interesting that much of the driver for Blackberry in enterprises (especially Government) has been the IT Security folks. This is because RIM has a simple, closed network that enables quick delivery of messages with the controlled platform that appeals to Security. Blackberry may not have been implemented (originally) to deliver the best end-user experience platf.

The biggest sin of RIM seems to have been their crisis communications, and perhaps a tinge of arrogance (possibly due to many years of commercial success).

Perhaps RIM have not understood the different perspective they need to take as they change from a technology vendor to a service provider. Here then is another lesson to be learned from this incident;

Make sure you understand the true nature of the relationships in your value chain.

RIM deliver their service via numerous local service providers – many of them are Telecommunications and mobile phone providers. They did not seem to be able to operate collectively to manage the message thought this incident.

It is unlikely that customers buy their Blackberry service direct from RIM, but they see the Blackberry brand rather than their actual service provider being impacted here.

The service providers may experience a loss of Blackberry revenue, but that may be offset by their customers dropping Blackberry service and moving to alternatives such as iPhone.

RIM needed a crisis communication strategy that reflected these relationships. Instead their Canadian-based Twitter support came back on line on the 11th, after several hours of European outage, and seemed to be totally unaware of the incident.

Ultimately the real test of BCM failure in this incident will be if RIM stay in business. A similar outage for National Australia Bank in Australia did not seem to impact their profitability.

Again I ask the question – do we over value reputation impacts from these kind of outages? RIM will suffer if corporates and large Government clients leave.

If you have Blackberry in your organisation, you need to understand how that service is delivered, inside and outside your company – no excuses.

There is rarely an effective workaround for the deliver of online, electronic services. High Availability and Continuous Operations are your continuity strategies – it is essential to understand the real impacts of having to resort to DR strategies, and especially your capability to handle backlogs.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Leave a Reply Cancel reply