Readers like you help support MUO. When you make a purchase using links on our site, we may earn an affiliate commission. Read More.

Most organizations rely heavily on their IT infrastructure to run their operations. Unplanned system failures or performance degradation can lead to disruptions, financial losses, and damage to reputation.

Automated system health checks are crucial to ensure that IT infrastructure remains stable and reliable. By monitoring critical metrics and promptly detecting anomalies, you can minimize downtime.

Defining Health Checks

It is essential to define what health checks you want to perform on your system. You should establish clear criteria for what you'll monitor and why. Begin by identifying the primary goals of your system. What functions or services does it provide?

Then, set performance benchmarks based on historical data and ensure your health checks assess the efficient use of system resources. Finally, define the thresholds that indicate a problem. What percentage of resource usage do you consider high or low? At what point should the system trigger an alert?

Choosing Libraries and Setting Up Your Environment

To automate the system monitoring process in Python, you will need the following libraries to help you gather system metrics and then schedule the checks.

  • psutil: This is a cross-platform library that provides an interface for retrieving information on system utilization (CPU, memory, disks, network, sensors).
  • schedule: This library provides a simple way to schedule tasks to run at specific intervals.
  • time: A Python built-in library that you will use for time-related operations.
  • logging: Another built-in library that you will use for creating logs of the system health checks.

Start setting things up by creating a new Python virtual environment. This will prevent any potential version library conflicts. Then run the following terminal command to install the required libraries with Pip:

 pip install psutil schedule

Once the libraries are installed on your system, your environment is ready.

The full source code is available in a GitHub repository.

Importing the Required Libraries

Create a new script, monitoring.py, and begin it by importing the required libraries:

 import psutil
import schedule
import time
import logging

Importing the libraries will allow you to use the functionality they offer in your code.

Logging and Reporting

You need a way to log the results of your health checks. Logging serves as a vital tool for capturing and preserving a historical record of events and debugging problems in your code. It also plays a critical role in performance analysis.

Use the built-in logging library to create your logs for this project. You can save the log messages to a file named system_monitor.log.

 # Function to log messages
def log_message(message):
    # Configure logging
    logging.basicConfig(filename='system_monitor.log', level=logging.INFO,
                       format='%(asctime)s - %(message)s')
    logging.info(message)

For reporting, print an alert message on the console to serve as immediate notification about any issues that require attention.

 # Function to print alerts to the console
def print_alert(message):
    print(f"ALERT: {message}")

The health check functions will use these functions to log and report their relevant findings.

Creating Health Check Functions

For each health check, define a function that will encapsulate a specific test that evaluates a critical aspect of your infrastructure.

CPU Usage Monitoring

Start by defining a function that will monitor CPU usage. This will serve as a critical indicator of a system's overall performance and resource utilization. Excessive CPU usage leads to system slowdowns, unresponsiveness, and even crashes, severely disrupting essential services.

By regularly checking the CPU usage and setting appropriate thresholds, system administrators can identify performance bottlenecks, resource-intensive processes, or potential hardware issues.

 # Health check functions
def check_cpu_usage(threshold=50):
    cpu_usage = psutil.cpu_percent(interval=1)

    if cpu_usage > threshold:
        message = f"High CPU usage detected: {cpu_usage}%"
        log_message(message)
        print_alert(message)

The function checks the current CPU usage of the system. If the CPU usage exceeds the threshold in percentage, it logs a message indicating high CPU usage and prints an alert message.

Memory Usage Monitoring

Define another function that will monitor the memory usage. By regularly tracking memory utilization, you can detect memory leaks, resource-hungry processes, and potential bottlenecks. This method prevents system slowdowns, crashes, and outages.

 def check_memory_usage(threshold=80):
    memory_usage = psutil.virtual_memory().percent

    if memory_usage > threshold:
        message = f"High memory usage detected: {memory_usage}%"
        log_message(message)
        print_alert(message)

Similar to the CPU usage check, you set a threshold for high memory usage. If memory usage surpasses the threshold, it logs and prints an alert.

Disk Space Monitoring

Define a function that will monitor the disk space. By continuously monitoring the availability of disk space, you can address potential issues stemming from resource depletion. Running out of disk space can result in system crashes, data corruption, and service interruptions. Disk space checks help ensure that there is sufficient storage capacity.

 def check_disk_space(path='/', threshold=75):
    disk_usage = psutil.disk_usage(path).percent

    if disk_usage > threshold:
        message = f"Low disk space detected: {disk_usage}%"
        log_message(message)
        print_alert(message)

This function examines the disk space usage of a specified path. The default path is the root directory /. If disk space falls below the threshold, it logs and prints an alert.

Network Traffic Monitoring

Define a final function that will monitor your system's data flow. It will help in the early detection of unexpected spikes in network traffic, which could be indicative of security breaches or infrastructure issues.

 def check_network_traffic(threshold=100 * 1024 * 1024):
    network_traffic = psutil.net_io_counters().bytes_recv +\
                      psutil.net_io_counters().bytes_sent

    if network_traffic > threshold:
        message = f"High network traffic detected: {network_traffic:.2f} MB"
        log_message(message)
        print_alert(message)

The function monitors network traffic by summing the bytes sent and received. The threshold is in bytes. If network traffic exceeds the threshold, it logs and prints an alert.

Implementing Monitoring Logic

Now that you have the health check functions, simply call each one in turn from a controller function. You can print output and log a message each time this overall check runs:

 # Function to run health checks
def run_health_checks():
    print("Monitoring the system...")
    log_message("Running system health checks...")

    check_cpu_usage()
    check_memory_usage()
    check_disk_space()
    check_network_traffic()

    log_message("Health checks completed.")

This function runs all health checks, providing a unified view of your system's health status.

Scheduling Automated Checks and Running the Program

To automate the monitoring at specific intervals, you will use the schedule library. You can adjust the interval as needed.

 # Schedule health checks to run every minute 
schedule.every(1).minutes.do(run_health_checks)

Now run the system monitoring process in a continuous loop.

 # Main loop to run scheduled tasks
while True:
    schedule.run_pending()
    time.sleep(1)

This loop continuously checks for scheduled tasks and executes them when their time comes. When you run the program the output is as follows:

Output of a system monitoring program in an IDE

The program records the monitoring logs on the system_monitor.log file and displays an alert on the terminal.

Advancing the System Monitoring Program

These monitoring checks are not the only ones that psutil supports. You can add more monitoring functions, using a similar approach, to suit your requirements.

You can also improve the reporting function to use email rather than outputting a simple message on the console.