Efficient Management of the Data Center’s Engineering Infrastructure
Two control loops can be distinguished in the engineering infrastructure of a data center. One deals with cooling and distribution of power at the rack level, while the other works at the level of the entire facility and deals not only with power and air conditioning but also various auxiliary subsystems (fire suppression, access control, and others). Often, these loops and even their components are independent of each other and are operated by different teams of operators.
Companies are not always willing to purchase integrated solutions for engineering infrastructure management. Usually, commercial data centers do not have such an option. However, in the corporate segment, it is not uncommon for the company's management to try to save money and to agree to include only fragmented systems for air-conditioners and UPS in the estimate. Lack of communication between control circuits, different levels of automation of the data center subsystems, and a fleet of different equipment vendors complicate the coordinated work of all parts of the facility, making optimization impossible.
In the worst case small scale scenario, data center subsystems are controlled manually, and the installation and movement of equipment are documented using Microsoft Excel. Often, the paperwork is a mess, which is quite natural since keeping a correct database using spreadsheets is a very challenging task. When the quantity of racks is measured in tens, problems with manual accounting are inevitable. Replacement of equipment in such a data center takes place only as a failure occurs, which increases overhead costs and downtime in the event of an accident.
If downtime is critical for data center owners, the reactive control model is used. In this case, the troubleshooting procedure is regulated, and the accompanying paperwork is maintained. However, the process is based on the experience of employees and their knowledge of a particular data center. In case of an accident, the problem is eliminated quickly enough, but there are serious difficulties when it comes to prevention due to the lack of opportunities for a comprehensive analysis of the causes of a malfunction. In a situation when only a couple of experts know how to deal with all the processes of facility management and in case, for example, one expert is dismissed, new problems arise.
A more advanced management model is always service-oriented. It assumes the presence of complete paperwork covering all subsystems of a facility. It clearly defines the rules and procedures for replacement and the preventive maintenance of equipment, and keeps a thorough accounting of its installation and movement, while operational services prepare reports on the parameters of engineering systems, accidents, and actions taken by personnel to eliminate them.
The main feature of the service-oriented approach to data center management is proactivity. This model not only allows analyzing the causes of errors but also anticipating problems before they occur. Workarounds to quickly restore the availability of services can be established. Of course, such an approach is impossible without the introduction of a single automated monitoring and dispatching system for all critical data center subsystems. Practice shows that employee actions often cause their failures. There is always a shortage of highly-skilled experts, but if the dispatching center is automated and all the facility maintenance rules and regulations are formalized, most of the personnel need only basic knowledge.
Monitoring and Dispatching
About ten years ago, DCIM (Data Center Infrastructure Management) solutions, which combine all engineering subsystems into a single logical structure, appeared on the market. The first versions of DCIM allowed drawing up schemes/plans of objects and maintaining paperwork, but now their functionality has changed significantly. Modern solutions can interact with the monitoring tools built into the equipment of different manufacturers and connect additional sensors, controllers, signal converters, and data collection systems. Most often, information on energy consumption at all levels, up to the rack level, temperature and humidity in racks, cooling systems and inside ducts, as well as data on fluid leaks are collected. It is the minimum necessary to accomplish the intended purpose.
Once DCIM is in place, the customer receives an integrated monitoring and control environment that will include all the critical subsystems and even IT equipment in some cases. Its main task is to combine the data streams coming from the maximum number of available sources. The information is collected and processed in real time, which gives the service personnel a complete picture of the functioning of all subsystems of the data center, including, if necessary, its computing power. This is where we see yet another advantage of DCIM, which is reducing the impact of human factors on the performance of data center subsystems.
The Problem of Choice
There could be different scenarios for its introduction, but it is best to lay down DCIM at the design stage of the facility. There are also options for integrating existing stand-alone subsystems using equipment from different manufacturers. The choice of a solution at the design stage of the data center does not cause any problems; it is usually done by a system integrator who helps pick the necessary hardware and software.
The situation with an existing data center is much more complicated. In this case, it is necessary to gather a working group, including representatives of all interested departments. It is required to make a list of all the parameters and nodes of the infrastructure that will be monitored, and arrange them in descending order of importance. Next, it is necessary to audit the protocols and means of communication supported by the infrastructure equipment, as well as to consider what additional sensors and controllers will have to be installed.
With all this information, you can select the necessary software solutions, list additional equipment, and calculate the project budget. It is an excellent idea to outsource the introduction of DCIM into the existing object completely. Errors at the design stage will cost more than the system integrator services. Initially, DCIM systems were local, but now many developers begin to offer them as a service (SaaS). This approach allows for a significant reduction in capital expenditures.
The main item in the structure of the data center’s operating expenditures is the cost of electricity. The operation of IT equipment and cooling systems racks up high electric bills. Therefore, optimizing energy consumption is a priority. It depends on a vast number of external and internal factors. For example, climate and weather conditions, including seasonal changes, directly affect cooling systems. One can also add peaks and drops in the load on computing and telecommunications equipment and dozens of other nuances. It is impossible to take them all into account manually, but the DCIM system will allow you to accumulate real operating statistics and analyze them, identifying problem areas in the infrastructure of the facility.
One of the most critical indicators for a data center is the Power Usage Effectiveness (PUE) coefficient, which shows how much power is spent on IT load operation and how much is spent on auxiliary needs, such as cooling and UPS operation, and losses in the distribution system. It is calculated by dividing total energy consumption by IT equipment consumption. Until recently, the PUE factor of 1.6 to 2.0 has been considered acceptable. Now, however, the market requires more efficient data centers, so the struggle for the values starting from 1.1 to 1.2 is on. Most often, consumption is measured at the output of the UPS, at the output of the power distribution unit, and for the actual use of IT equipment.
Based on the data obtained, it is possible to determine the energy efficiency of the data center quite accurately. However, the PUE does not reflect all the nuances associated with it. This factor is still important, but it does not allow, for example, taking into account the downtime of servers or identifying problematic hot spots. Also, reducing the PUE to values close to one often comes at the expense of reducing the reliability of the data center—accidents and a reduction of the service life of equipment can negate the effect of energy savings.
Modern control systems collect energy consumption data from servers, racks, and distribution equipment. It is even possible to monitor each socket. Statistics on the consumption of critical resources can be visualized in an easy-to-understand form, which makes it easier to find the most energy-intensive areas and to optimize costs. Load reduction periods can also be identified to schedule maintenance during these periods. Consumption peak analysis allows keeping the power reserve in the range of 10%-15% instead of 30%-40%, as in the case of manual control. Moreover, that's significant savings, too.
DCIM-class solutions also monitor other engineering subsystems. For example, they help map the airflow to identify problematic areas of the air conditioning and climate control system, which is second only to IT equipment in terms of electricity consumption in the data center. Do not forget about troubleshooting even before serious problems occur, prevention, and rapid elimination of the issues, which increases the reliability of the infrastructure and also reduces costs. Manual control is possible only in the case of a small server room, but when there are dozens or even hundreds of racks in the server room, the introduction of DCIM becomes a necessity.
So far, we have only talked about the engineering infrastructure, because IT infrastructure management is considered a separate task. Usually, this is done using systems that are not related to DCIM. When it comes to commercial data centers, the work of IT equipment is in the area of the responsibility of customers. However, the development of virtualization and converged/hyperconverged architecture is gradually changing the situation. Today, developers are developing the solutions that enable real-time monitoring of the status of virtual servers on individual physical devices, and IT vendors are embedding a vast number of sensors in their products to monitor power consumption and temperature.
Effective load planning in virtual environments must cover all levels: operating systems and applications, servers, storage systems, telecommunications equipment, and communication channels, and, of course, physical resources, such as power, cooling, humidification, and others. DCIM solutions are no longer a "thing in itself" in large corporate data centers. Their close integration with virtualization platforms and IT infrastructure management systems is only a matter of the near future.
How Do I Deploy the Best Back-up Power and Environment Monitoring & Management Systems in a Datacenter?