Author: Steven Brown Schneider Electric Blog
The last thing anybody wants in their data center is an unplanned outage. Roughly one-third of all reported data center outages end up costing more than $ 250,000 annually, with many more exceeding $ 1 million, according to the Uptime Institute. The number one cause of data center outages is power failures, which account for 36 percent of the biggest global public service outages since 2016. So, a plan to prevent data center power outages is mission-critical.
The challenge for data center managers is that IT and physical infrastructure data have traditionally been monitored and managed separately. This has limited the ability of data center managers to conduct event correlation to anticipate and avoid potential threats that could impact their business.
These issues have been further exacerbated by the recent macro events forcing companies to collaborate differently, and now many data center staff have transitioned from working in the office to working from home. Because different IT work groups are supporting different systems (compute, storage, networking), IT managers are required to do more with less and maximize available resources.
Providing visibility into critical IT infrastructure is essential to understanding how this will impact the business. Today, data center managers do a good job of monitoring and managing alarms for IT equipment (compute, networking, storage) but lack the visibility within their critical IT physical infrastructure. Monitoring power is sometimes considered an afterthought and data center managers could be adversely impacting their business. Power is essential, and if it’s not working, IT fails. Data center managers need to have a better understanding of how to correlate data between these two systems to implement mitigation steps to minimize downtime. Capturing critical alarm types and correlating that data will help identify an issue before it actually happens.
To stay ahead of these outages, data center staff need to ensure they are monitoring alarms and understand where and when to take corrective action. Leveraging a data center infrastructure management (DCIM) tool that supports a single pane of glass view into their physical IT infrastructure can provide more efficiencies than having two disparate systems.
Monitor These Six Data Center Alerts to Increase Uptime
Below are six of the most common alarm types being reported by critical IT physical infrastructure devices:
- Input voltage or frequency cannot support bypass: This alarm is triggered when an automatic transfer switch is not able to switch over to backup when there is a failure.
- Maximum value temperature threshold: When a three-phase UPS or PDU exceeds its temperature setting, the alarm is set off.
- Environmental appliance alarm: This alarm could be triggered by a variety of physical, environmental or human threats, including temperature, humidity, door contact, leak, vibration or smoke.
- Lost connection between data center and remote monitoring system: In this scenario, the gateway is down and has lost network connectivity.
- Intelligence module fault: The network management card on a UPS is reporting a problem, such as a failure of the intelligence module, which would then need to be replaced.
- Communication Status Threshold: The alert may be related to the threshold level being exceeded by a power device configuration.
Increasing Visibility to Common Alarms
Capturing these alarm types, correlating the data, and taking corrective action is critical to the success of a data center manager. For example, if there’s an alert signaling an impending outage at a specific PDU port that affects one server, maybe that virtualized workload can be quickly shifted to another server in order to avoid any interruption of business processes. Or, even better, what if preventive maintenance could enable data center managers to replace aging UPS batteries before a failure occurs?
Improve Data Center Monitoring
That’s the ultimate goal of a new collaboration between Lenovo and Schneider Electric that integrates Schneider Electric’s EcoStructure™ IT Expert with Lenovo’s XClarity Orchestrator. This integration means that organizations with Lenovo’s xClarity Orchestrator solution can monitor and manage servers, storage and networking equipment in a data center, but can now also monitor UPS, and PDUs through their centralized management console. It can capture power alarm events and take corrective actions, which helps reduce complexity, improve response times, and minimize downtime. Additionally, Lenovo customers may be able to monitor the infrastructure equipment metrics, such as tracking power consumption and temperature.
As new versions of the XClarity/EcoStructure IT Expert integration will roll out over time, the intent is for Lenovo customers to be alerted to which datacenter infrastructure issues can affect specific IT equipment and be given recommended actions to take.
Eventually, this system could take on automation capabilities to include AI and ML to proactively identify and automatically remediate critical infrastructure failures before they happen. This will free up data center managers to focus more time on strategic activities. To learn more about Lenovo XClarity Orchestrator, visit this web page.
About the Authors
Jeff Van Heuklon
Jeff Van Heuklon is the chief architect for Lenovo’s XClarity systems management products. These products provide health monitoring, detailed inventory views, provisioning, serviceability and analytics functions. In addition, Jeff leads the activities to integrate these products with software from other companies. With 38 years of industry experience with Lenovo, and previously IBM, he has a wide range of knowledge of datacenter software.