Backend#
This page will cover how to set up a backend or device for the
Trainer class. The cerebras.pytorch.backend
is simply the configuration of the device and other settings used during a run.
The device is simply what hardware the workflow will run on.
Prerequisites#
Make sure to have read through Trainer Overview and Trainer Configuration Overview which provide the basic overview of how to run Model Zoo models. In this document, you will be using the tools and configurations outlined in those pages.
Configure the device#
Configuring the device used by the Trainer can be
done by simply specifying one of "CSX", "CPU", or "GPU".
trainer:
  init:
    device: "CSX"
    ...
  ...
from cerebras.modelzoo import Trainer
trainer = Trainer(
    device="CSX",
    ...,
)
...
Note
Setting device still creates a cerebras.pytorch.backend instance
just with default settings. To configure anything about the backend, you must
specify those parameters via the backend key instead.
Limitations#
Once a device is set, any other Trainer instances
must also use the same device type as well. You cannot mix device types. For
example, a configuration like this:
# THIS CONFIGURATION IS INVALID
trainer:
- trainer:
    init:
      device: "CSX"
      ...
    ...
- trainer:
    init:
      device: "CPU"
      ...
    ...
from cerebras.modelzoo import Trainer
# THIS CONFIGURATION IS INVALID
trainer1 = Trainer(
    device="CSX",
    ...,
)
trainer2 = Trainer(
    device="CPU",
    ...,
)
...
will result in the following error:
RuntimeError: Cannot instantiate multiple backends. A backend with type CSX has already been instantiated.
Configure the backend#
Configuring the backend used by the Trainer can be
done by creating a cerebras.pytorch.backend instance.
The configuration is expected to be a dictionary whose keys will be used
to construct a cerebras.pytorch.backend instance.
trainer:
  init:
    backend:
      backend_type: "CSX"
      cluster_config:
        num_csx: 4
        mount_dirs:
        - /path/to/dir1
        - /path/to/dir2
        ...
      ...
    ...
  ...
Construct a cerebras.pytorch.backend instance and pass it to
the backend argument.
import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
trainer = Trainer(
    backend=cstorch.backend(
        backend_type="CSX",
        cluster_config=cstorch.distributed.ClusterConfig(
            num_csx=4,
            mount_dirs=["/path/to/dir1", "/path/to/dir2"],
            ...,
        ),
        ...,
    )
    ...
)
...
Limitations#
Multiple backend instantiations with different devices is not supported. You will see this error:
RuntimeError: Cannot instantiate multiple backends. A backend with type CSX has already been instantiated.
That means that when you construct one or more Trainer
instances, you must ensure you only instantiate backends of a single device type.
However you can change other backend parameters between Trainer
instances. For example:
The configuration is expected to be a dictionary whose keys will be used
to construct a cerebras.pytorch.backend instance.
trainer:
- trainer:
    init:
      backend:
        backend_type: "CSX"
        cluster_config:
          num_csx: 4
          mount_dirs:
          - /path/to/dir1
          - /path/to/dir2
          ...
        ...
      ...
    ...
- trainer:
    init:
      backend:
        backend_type: "CSX"
        cluster_config:
          num_csx: 2
          num_workers_per_csx: 1
          mount_dirs:
          - /path/to/dir1
          - /path/to/dir2
          ...
        ...
      ...
    ...
Construct a cerebras.pytorch.backend instance and pass it to
the backend argument.
import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
backend = cstorch.backend(
    "CSX",
    cluster_config=cstorch.distributed.ClusterConfig(
        num_csx=4,
        mount_dirs=["/path/to/dir1", "/path/to/dir2"],
        ...,
    ),
)
trainer1 = Trainer(
    backend=backend,
    ...
)
backend.cluster_config.num_csx = 2
backend.cluster_config.num_workers_per_csx = 1
trainer2 = Trainer(
    backend=backend,
    ...
)
Mutual Exclusivity#
The device and backend arguments are mutually exclusive. It is expected
when initializing a Trainer to set one of them
but not both. If both are set, you will see an error that looks like this:
ValueError: backend and device are mutually exclusive arguments of Trainer. Please only provide one or the other
Conclusion#
That is all you need to know for configuring the device or backend for
the Trainer. You should now have a better understanding of how to configure the
Trainer with a device or backend and common errors you may run into.
Further Reading#
To learn more about how you can use the Trainer
in some core workflows, you can check out:
To learn more about how you can extend the capabilities of the
Trainer class, you can check out: