Running jobs in parallel with the same model_dir causes issues#
Observed error#
Running several jobs in parallel with the same model_dir
may cause issues.
Explanation#
When a model_dir
is used in multiple parallel runs, it could cause a race condition (two operations being performed on the same object at the same time) as each run attempts to write their file-backed tensors to the same directory.
Workaround#
Use different model directories, or
Add a unique artifact sub-directory by generating a
uuid
.