Cluster monitoring with Grafana#

Overview#

A Grafana dashboard lets you visualize, query, and explore your system’s metrics and enables you to access your system logs and traces. Cerebras offers you two Cerebras-tailored Grafana Dashboards: Cluster Management Dashboard and WsJob Dashboard.

Cluster Management Dashboard#

The Cluster Management Dashboard shows the overall state of the cluster. It includes the following:

  • CS-2 systems

    • Overall CS-2 systems status and errors

  • Nodes:

    • Kubernetes nodes warnings and errors

    • Space usage health

  • Network

    • Hardware NIC errors

    • Kubernetes CNIs errors

  • Cluster Management

    • Errors for Cerebras Cluster Management services

    • Kubernetes system services

  • Alerts

    • Current alerts on the cluster

The following figure displays Cerebras’s Cluster Management dashboard:

../../_images/cluster-mgmt-dashboard.png

WsJob Dashboard#

The WsJob Dashboard displays job-related metrics and is useful when checking a job’s resource usage and any software or hardware errors relevant to this job.

There are five panes in this dashboard:

  • Job overview

    • Displays the overview of memory/cpu/network bandwidth numbers for all replicas of selected job

  • Job associated software errors

    • Displays job runtime errors (currently only shows OOMKilled status)

  • Job associated hardware errors

    • Displays any NIC, CS-2, or physical node that is assigned to this job and is having errors during the job execution

  • Replica view

    • Displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service processes for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, coordinator. Replica_id corresponds to the specific replica for a job and a replica type

  • Assigned nodes

    • Displays physical nodes status that are assigned to the chosen replica_type and replica_id

  • MemX performance

    • Shows iteration-based performance, iteration time, cross-iteration time, and backward iteration time

The following figure displays Cerebras’s WsJob dashboard:

../../_images/perf_dashboard_full_view.png

On the left you can find options to search for particular metrics and view metric details.

There are also filters for users to select:

  • wsjob

    Indicates the ID of the weight-streaming run, which is used to select between different runs on a particular system

  • replica_type

    Allows selecting between the activation, weight, and all server metrics

  • assigned_systems

    Indicates the system name being shown in the logs

Other fields available that are useful are the model, job_type, and the replica_id.

Prerequisites#

You have access to the user node in the Cerebras Wafer-Scale Cluster. Contact your sys admin if you face any issues in the system configuration.

You can run a port-forwarding SSH session through the user node from your machine with this command:

$ ssh -L 8443:grafana.<cluster-name>.<domain>.com:443 myUser@usernode

Note

This command uses the local port 8443 to forward the traffic. You can choose any unoccupied port on your machine.

How to get access?#

Links are accessible from the General/Cerebras tab. The following figure displays a Cerebras dashboard:

../../_images/dashboards-intro.png

Steps to get access#

1. Ask your system admintrator to set up the Grafana database. URLs come in the format: grafana.CLUSTER-NAME.DOMAIN.com For example: grafana.mb-systemf102.cerebras.com

2. Get authentication credentials for Grafana (username and password) from your system administrator.

3. Add the Grafana TLS certificate to your browser keychain. The grafana TLS certificate is located at /opt/cerebras/certs/grafana_tls.crt on the user node. This certificate is copied during user node installation process. Download this certificate to your local machine and add this certificate to your browser keychain.

On a Chrome browser on Mac OS:

  1. Go to Preferences -> Privacy and Security -> Security -> Manage Certificates

  2. Add grafana-tls.crt into System keychain certificates. Make sure to set Always Trust when using this certificate

  3. Next, edit your local machine’s /etc/hosts file to point the IP of the user node to Grafana: <USERNODE_IP> grafana.<cluster-name>.<domain>.com

  4. Finally, navigate in your browser to the URL HTTPS://grafana.<cluster-name>.<domain>.com to access the Grafana Dashboards

Viewing performance metrics using the WsJob dashboard#

You can view cluster iteration-performance metrics by tracking update times across the weight servers.

Our current dashboard implementation shows iteration time, forward-iteration time, backward-iteration time, cross-iteration time, and input starvation.

  • Iteration time

    Indicates the time from the end of iteration i-1 on the weight servers to the end of iteration i on the weight servers.

  • Forward-iteration time

    Indicates the time spent in iteration i during the forward pass.

  • Backward-iteration time

    Indicates the time spent in iteration i during the backward pass.

  • Cross-iteration time

    Indicates the time between the last gradient receive of an iteration to the first weight send. A high value indicates an optimizer performance bottleneck.

  • Input starvation

    Indicates the time spent waiting on the framework to receive activations.

These statistics are shown in the following image and can be used to identify performance bottlenecks in the training process:

../../_images/perf_dashboard_perf.png

Viewing utilization metrics using the WsJob Dashboard#

The Replica view metric displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service process for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, and coordinator.

1. Transmit bandwidth indicates the maximum and mean network egress speeds for each activation server. This might be helpful information to monitor whether jobs are network-bound via the transmission speeds of a lagging node.

The following figure shows that most weight servers achieve a network transmit speed of ~420 MB/s:

../../_images/perf_dashboard_transmit_bw.png

2. Receive bandwidth denotes the ingress speeds for each supporting server. For example, in this instance, the weight servers have an average ingress speed of around 220MB/s.

The following figure shows the receive bandwidth metric:

../../_images/perf_dashboard_receive_bw.png

3. CPU usage shows the CPU percentage utilization for each weight-server. In this case, the CPUs are only 2-3% utilized.

The following figure shows the cpu usage metric:

../../_images/perf_dashboard_cpu.png

4. Memory usage indicates the maximum and mean amounts of memory each weight server uses over time. This can be useful in debugging whether the weight servers are memory bound. For more information on memory requirements, visit Resource requirements for parallel training and compilation.

The following figure shows the memory usage metric:

../../_images/perf_dashboard_memory.png

5. You can use the Grafana interface to show individual metrics for each node. For example, these are the views for CPU and memory usage per node:

The following figure shows the cpu usage per node metric:

../../_images/perf_dashboard_cpu_node.png

The following figure shows the memory usage per node metric:

../../_images/perf_dashboard_memory_node.png