Manufacturers must have computer systems that work. When talking about making systems that keep working, we often talk about “hardening” the system, but that can be misleading. Some things are hard but brittle, in that they can resist light pressure but shatter under a hard shock. What you want are systems that are resilient—able to absorb shocks and bounce back without breaking. That sounds good, but how do you translate “resilient” into an operational computer system?
In most circumstances, continuous operation translates to “fail-over” or a redundant component waiting to take over instantly if something goes wrong with the primary component. Depending on your organization’s needs and budget, this can range from an equipment rack with duplicate hardware sitting on the other side of the data center to a complete mirror datacenter in a remote location.
In both cases, the key to redundancy is a controller that mirrors transactions to the fail-over system and switches to the redundant system if it detects specific conditions on the primary device. Many executives think that buying redundant hardware is an unnecessary cost. Compared to the cost of catastrophic failure when a critical system goes offline, redundancy can be amazingly cost-effective.
Next on the resilience list is defense from criminal attacks. You’ve heard all the stories, but there are still far too many IT executives in manufacturing works who feel that no one would want to attack their operations. Of those who do think about security, hiding resources—“security through obscurity”—is still seen as an effective strategy by some. They’re wrong. Active security that targets not only nuisance attacks such as port probes but also advanced persistent threats and social engineering attacks is required if your systems are going to be there every time the company needs them.
Planning for Failure
Resilience means planning for and being ready for failure.
Planning begins with an effective backup and recovery strategy that includes practicing restore operations on a regular basis. Too many companies have placed their faith in a backup and restore regime that met technical needs but failed when needed because small issues were never caught and fixed.
Resilience can include off-site storage of backup media and an off-site data center that can be brought online with backup media. The time to get back online might be measured in hours rather than seconds. However, for many functions a few hours is an adequate timeframe, and far better than critical functions, such as inventory, accounts receivable, or payroll, being off-line for days, as in the case of a natural disaster.
Hardening a system might mean trying to ensure that no external event would have an impact on it. Resilience recognizes that the outside world exists and will have a bearing but designs systems that can absorb the shock and continue operating. Design for resilience and you have a much better chance at maintaining your manufacturing operations, regardless of what the outside world might bring to bear on your IT systems.