Most computer breakdowns are caused by human error. From forgetting to check a UPS battery charge, to accidentally pushing the Emergency Power Off button, a simple mistake could cause a facility to stop all activity, and even lose data. In smaller systems, failing to install voltage transient protection is a major cause of equipment breakdown.
With the obvious comes the more subtle reasons why computers fail. This is a summary of the most likely culprits.
- Voltage spike – This is a momentary change in the supply of electric power. It’s important to remember that the size of the transient can range from just milliseconds to longer periods of time. Even a small power glitch can damage a computer and corrupt data. If the screen goes blank during a thunderstorm, it’s likely that a voltage spike is responsible.
- Circuit board fails with no visible indication of damage – This is the typical scenario: A computer goes down. A defective circuit board is suspected of the fault, but an inspection reveals no signs of physical damage. A replacement board is installed and operation resumes.
- Failure to load software upgrades – Computer software often comes with various bugs, or vulnerabilities. These are simply programming mistakes, but they can make the software susceptible to viruses and glitches. When an error is discovered, the software designer sends out a “patch” that requires downloading and rebooting. Failing to upgrade the software by running the patch can lead to corruption of data.
- Failure to keep virus protection up to date – There are several malware protection packages available for computer rooms or data centers. Once installed, the subscription must be maintained. One telltale sign of a malware infected computer is slow data processing.
- Inadequate cooling – Computers produce heat, and the more data they process, the more heat must be eliminated. Servers, the “brains” of computers, fail under excess heat, so adequate cooling is essential.
- Environment – If a computer room has a dedicated cooling system, there may be an opportunity to closely control the heat. If the cooling is shared with other equipment or the space contains offices, we often face a more challenging task. People have comfort needs different to those of computers.
- Loss of Power, PDU fails – Power Distribution Units (PDU’s) are similar to power bars in homes and offices. They distribute power to different components in a computer room, and are usually located next to the server racks. PDU’s suffer a range of problems. Receptacles wear out and can be the sources of bad contacts, and moisture can lead to corrosion.
- UPS batteries – Uninterruptible Power Supply (UPS) batteries are found in computer rooms. Most often, they are used to provide continuous power for a limited time, during an outage. Depending on how it’s used and the design of the battery, the life span can be as short as 4-5 years for what is considered a “10 year life” battery. Elevated room temperatures will shorten the life of any battery. Even a small facility such as a dentist office uses UPS batteries.
- Transfer switch battery defective – This can be an indication that the site lacks an effective battery maintenance program. It can also be a sign that the transfer switch is “frozen” in position. Preventive maintenance could help avoid a breakdown.
- Failure of standby power to synchronize (come on line) – When electric power from the grid disappears, standby power takes over in two different ways. In a small computer room, the focus is on turning computers off in a controlled manner. A fully charged UPS will provide power for 10-15 minutes to allow a computer to save its data and turn off. In a larger facility, it’s the matter of keeping cooling and power going long enough to give the standby generator an opportunity to “synchronize,” to get up to a rate of speed sufficient to power the infrastructure and continue data processing.
- Diesel fuel contaminated – Most medium to large facilities have generators powered by diesel fuel, which can become contaminated and useless when a power outage occurs. Water is usually the culprit. Regular fuel testing is necessary to avoid contamination.
- Storage defective – Data is stored on magnetic disks or tape. Both can be damaged during normal operation, so it’s a good practice to back up stored data on a regular basis. Backing up means creating a spare data storage. Failure of a tape or disk drive is often, but not always, proceeded by noise. Opening the unit to attempt data restoration is better left to a professional.
- Sprinkler system turns on above the servers – Where a data room is protected by a water sprinkler system, a small spill can damage the electronic components. This is particularly true in cases where sprinkler heads are located directly above the servers.
Computers are crucial to business today, but there are so many things that can cause them to fail. By addressing the above common culprits, you could help reduce the overall risk of your computer systems failing.
© 2015 The Hartford Steam Boiler Inspection and Insurance Company. All rights reserved. This article is intended for information purposes only. All recommendations are general guidelines and are not intended to be exhaustive or complete, nor are they designed to replace information or instructions from the manufacturer of your equipment. Contact your equipment service representative or manufacturer with specific questions.
Great article, I think it can be very useful for an interested person.
Servers, the “brains” of computers, fail under excess heat, so adequate cooling is essential.