Remember me

Register  |   Lost password?










All site blogs

Things worth reading: 6th October 2011

October 6, 2011 by skinnercm   Comments (0)

Things we're reading today include ...

read more...

, , , , , , , , , , , , , , , , , , , , , ,

Resilience and Business Continuity

October 6, 2011 by CausalCapital   Comments (0)

esilience, resilience is perhaps the most important aspect of a solid business continuity program. When it comes to best practice operational continuity or perhaps we should say the businesses with the most stable risk systems seem to also be the companies that are most resilient.

The question we are asking today is, which industry sectors and which companies are most resilient?


Measures of resilience

The word resilience is derived from the Latin word resili or present participle. The power or ability to return to the original form or to recover from adversity.  In practical terms, it might be the ability for something to remain continuous or to revert back to its normal state of operation. If we were to take this literally then, we would be most interested in measuring how long it takes for an operation that is suffering a disruption, to resume.

In operational risk there are actually two measures we concern ourselves with, the first is how often something fails and the second is how long it takes to repair that failure or to be technical:

The Mean Time Between Failure or MTBF which over a statistically predicable time horizon should tell us how many failures we can expect. Say in a thousand trails or may be a million, how often did the service fail.

Now things that go bump in the night as they say, aren't so bad if they are fixed quickly. Given this we also need measure the Mean Time to Repaid-MTTR or estimated time to repair as the measure is often referred to.

Figure 1 : Up State and Down State
In figure 1 you should be able to clearly see the up state represented as red bars. You can equivalently tally up how often the system is down over its event horizon and you could if you cared, accumulate the amount of seconds (or milliseconds if applicable) it takes to bring that service back into operation during each outage.

To be available is normally expressed in up time over a one year horizon, so if I was to say to you this service is 99.885845% reliable, you may think that is a great statistic but in reality you can expect 10 hours worth of accumulated outage in a single year of operation.

Is that good?

Well let's hope it is a cumulative measure over 365 days especially if you depend on the service, like that of a jet engine in an aeroplane. Fret not for those like me who dislike flying, the majority of turbines being manufactured today are running "6 Nines" and most aircraft are able to fly on a single engine once they are cruising.


Building resilience
There are several key tips for building resilience into operations which some industry sectors seem to have been better at adopting than others, six key specific points to consider for continuous operation have been listed below:

[1] Failure is normal operation
If you want to improve operational effectiveness then design your system in a failed environment, an environment where volatility is the norm.  Systems that are built to operate in such places seem to be more robust in normal states of operation.

[2] Adding Redundancy

Adding redundancy is perhaps the easiest way to reduce mean time between failures although it may not be the most cost effective. None the less for mission critical applications, resilient firms employ this tactic.
Figure 2 : Designing with failure in mind

In the simple example above; a dual system that has 50% chance of operating in the next hour will increase its operation chances to 75% by making both systems (A) & (B) available. Serially arranged processes are a much more problematic, a weaker design and usually increase the failure rate in firms.

[3] Reducing Tail Events

Avoiding tail events in a natural manner. This is often accomplished by setting up firewalls between specific risk factors and drivers. For example many explosions require three elements (oxygen, fuel and an ignition source), so separating the elements will reduce the likelihood for an event.

[4] Operational Simplification

Businesses which are able to simplify processes, authority lines and machine elements seem to increase their reliability factor by reducing the mean time to repair. Additional, overly complex services or systems with many moving parts seem prone to higher chance of compound error.

[5] Outsourcing

Outsourcing generally does not build resilience but dependence, it may lower unit costs of  production but it can also interfere with internal risk and control, so it poses a substantial threat to an operating entity. Outsourced functions need careful review and often the establishment of redundancy programs that may sadly reduce the benefits from such ventures in the first place.
[6] Creative Industries
Businesses that embrace resilience seem to be those which are generally more creative, they rebuke bureaucracy and consequently can adopt to change more rapidly. These business models also have a tendency to make decisions swiftly because central authority is decentralized and they seem to thrive off diversity rather than stumble over it.


Resilience in practice
By looking across all industry sectors; mining, construction, energy production, military applications, hospitality, logistics, air traffic, finance, manufacturing, information technology; some industry sectors seem prone to failure more so than others. This is partly due to the environmental nature of their businesses or the lack of standards, but others seem to simply have appalling resilience benchmarks.

Military Applications

Two industries which are worth looking at from a resilience perspective are the military and the airlines. 

Why?

Well both these industries are operating in relatively volatile environments, this results in equipment being designed for failure. In reality no one would want to go to battle with guns that jammed or communication devices that became faulty if they were dropped or became wet. The US in particular has extremely high standards for both defining and testing operation known as the MIL-STD-810 standard. This standard addresses an incredible array of potential catastrophe factors, from shock to fungus and in itself is a leading edge in resilience design.
Airlines
The airline industry is also heavily regulated by standards that ensures equipment is able to operate in the most hostile environments, that failures result in shutdowns rather than explosive responses and that continuous operation is over five times what is normally expected.
Figure 3 : Sample FAR Standards

Figure 1 highlights a sample of the tests which are carried out on turbines and you can check out General Electrics' test bed for a successful Engine Icing Stall Test by following this <link>. Personally, I am impressed and a lot can be learned from the control quality General Electric seem to employ. 

Financial Services
You guessed it, finance was going to appear in this blog post seeing that is where I spend most of my time working and I left the finance bunch for the end of this article. In my opinion financial services would rate pretty poorly in our resilience factor ranking argument and if some of these banks were to run airlines, many of us would never want to fly again. 

On the resilience factor, manufacturing experience is probably more widely dispersed from a  global perspective and in some parts of the world construction would fall into that bucket as well. The banking sector however has been showing a deteriorated level of standards when it comes to resilience and in the year of 2011 many jurisdictions have been unable to run through a single event horizon without living through a critical failure. In short, the banks need to do more work.

, , , , , , , , , , , , , , , , , , , , , , , , , ,

Thought Experiment On Randomness

October 6, 2011 by jshore   Comments (0)

Was discussing pseudo-random number generators with a colleague, around desirable attributes of the distribution, periodicity, etc — all fun and important stuff.

read more...

, , , , , , , , , , ,