Skip to content

device support #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgommers opened this issue Sep 13, 2020 · 8 comments
Closed

device support #39

rgommers opened this issue Sep 13, 2020 · 8 comments
Labels
topic: Device Handling Device handling.

Comments

@rgommers
Copy link
Member

rgommers commented Sep 13, 2020

For array creation functions, device support will be needed, unless we intend to only support operations on the default device. Otherwise what will happen if any function that creates a new array (e.g. create the output array with empty() before filling it with the results of some computation) is that the new array will be on the default device, and an exception will be raised if an input array is on a non-default device.

We discussed this in the Aug 27th call, and the preference was to do something PyTorch-like, perhaps a simplified version to start with (we may not need the context manager part), as the most robust option. Summary of some points that were made:

  • TensorFlow has an issue where its .shape attribute is also a tensor, and that interacts badly with its context manager approach to specifying devices - because metadata like .shape typically should live on the host, not on an accelerator.
  • PyTorch uses a mix of a default device, a context manager, and device= keywords
  • JAX also has a context manager-like approach; it has a global default that can be set, and then pmaps can be decorated to override that. The different with other libraries that use a context is that JAX is fairly (too) liberal about implicit device copies.
  • It'd be best for operations where data is not all on the same device to raise an exception. Implicit device transfers are making it very hard to get a good performance story.
  • Propagating device assignments through operations is important.
  • Control over where operations get executed is important; trying to be fully implicit doesn't scale to situation with multiple GPUs
  • It may not make sense to add syntax for device support for libraries that only support a single device (i.e., CPU).

Links to the relevant docs for each library:

Next step should be to write up a proposal for something PyTorch-like.

@szha
Copy link
Member

szha commented Sep 14, 2020

Control over where operations get executed is important; trying to be fully implicit doesn't scale to situation with multiple GPUs

+1 to this. We will need to define an operation to explicitly move the data around among different devices, along with a canonical way of specifying the target device.

@oleksandr-pavlyk
Copy link
Contributor

Goals

It is not uncommon nowadays to get a laptop that has a CPU, an integrated GPU and a discrete GPU. Device selection specification should allow to select among the subsets of supported devices, but ensuring that each computational device can be unambiguously referenced.

Moreover, the specification should allow to select among several graphics cards, or more generally accelerators, of the same kind (say among multiple GPUs from the same vendor).

Statements about offloading

It is worth noting that device really refers to a tuple of (hardware, driver), i.e. an NVidia card can be programmed using either CUDA or using OpenCL, an AMD card can be programmed either using ROCm or OpenCL, an Intel card can be programmed using Level-Zero or OpenCL, etc. It may not be very important for a Python user to be able to select among available drivers dynamically. It is reasonable that an array implementation selects a driver for the device at module initialization stage.

To work with a device an associated runtime keeps a structure, context, that records the state of the device (c.f. what-is-cuda-context, open-cl-context) as well as information needed for synchronization between the host and the device. An array library may silently create context for a device. CUDA has a context created by CUDA runtime/user for each CUDA device (c.f. cuCtxCreate), while in an OpenCL/SYCL one can in principle create a context for a subset of devices from the same SYCL platform. In SYCL, a context also registers the asynchronous exception handler which array library implementation may set. For example, context(const vector_class<device> &deviceList, async_handler asyncHandler = {})

Runtime facilitates the task of imposing order on the sequence of tasks to be offloaded for an asynchronous execution at a device. CUDA provides stream, and SYCL provides queue. A CUDA stream executes a single operation sequence on a CUDA device (c.f. torch_cuda_stream). Programming CUDA requires understanding of the computation graph, and multiple concurrent streams may be used to execute segments with no data dependency. SYCL queue allows user to specify task dependencies at the time of scheduling and then SYCL run-time executes tasks honoring those dependencies [see section 4.9 of 2020 SYCL provisional spec].

Ultimately, user's selection of a device must allow array library to locate a queue/stream to submit tasks to, and for optimal performance's sake streams/queues should be reused once created. Reusing SYCL queue has an additional benefit of keeping track of data dependency while executing concurrently.

Python user wishing to offload a computation to GPU should be aware of stream/queue, but for most users they are going to be created once by an array implementation per device and reused through the session. Power users may want to explore use of multiple streams, or define task dependencies on SYCL tasks.

Toward the proposal

An array library targeting devices needs to know the device the array was allocated in. In the case of SYCL and OpenCL it also needs to know the context to which the memory was bound, for example to be able to free the memory.

In the case of OpenCL, one can use function clGetMemObjInfo to get the OpenCL context that cl_mem memory objects was created with.

However when using SYCL 2020's USM pointers, the context can not be easily looked up, so array library must either store it with each array object instance, or it should be stored in a global structure associated with the device and all operations involving the device must use that context, as is the case with PyTorch.

Thus to address a device in an array library, a user needs to specify a library-specific layer where the library stores backend-specific (backend here being software layer to work with devices, e.g. OpenCL, CUDA, SYCL, etc.) objects (context, queue/stream, etc.) associated with supported devices, as well as an identifier of a device in the backend.

A library may choose to query its backend for all addressable devices, store them in an array and refer to devices by their position therein. Such position should be deterministic between different Python sessions.

Considering an array library that uses SYCL runtime as its backend, devices can be further differentiated based on their kind: GPU devices, Accelerator devices, CPU devices, Custom devices. It would be appropriate for the array library to keep separate arrays of addressable devices per each kind.

Thus the device specification emerges to be a triple ('backend', 'device_kind', relative_id). Since backends would vary from array library implementations, array library must be able to provide a device with some
elements, but not all of the tuple omitted. Underspecified tuple is then understood as a device filter (c.f. Filter Selector SYCL extension proposal), and the array library chooses the most appropriate device to use at its discretion.

A portable code will then only use (device kind, relative id).

Array object should carry, or should be able to figure out the device it was allocated on (e.g. by implementing array_instance.device), and a user should be able to specify the device a new array is to be allocated on as well, which either calls for device= keyword for every function that may create a new array, with some sensible default value (infer device= from inputs, or use a default/current device settable by a context manager, or via an explicit function.)

Computations on device arrays must be submitted to a stream/queue. Implementation kernels must be able to access the data, so the queue must use the same runtime context that was used when array was allocated. Should this not be the same the implementation should either raise an exception (preferred), or invoke a copy via host.

The user should be able to specify the stream/queue where the kernel are to be submitted, like it is the case with native libraries (c.f. in cuBLAS see cublasSetStream, in SYCL-BLAS see executor, in oneMKL see execution model, in oneDNN see dnnl::sycl_interop::make_stream).

Ideally this should be seldom needed, so an optional keyword seems heavy-weighted, but could be a solution, with defaults driven by device array inputs and the default queue settings of the library, controllable via context and/or via explicit function.

@rgommers
Copy link
Member Author

It may not be very important for a Python user to be able to select among available drivers dynamically. It is reasonable that an array implementation selects a driver for the device at module initialization stage.

I agree, this is what libraries currently do so there's no need for something more complex.

Ultimately, user's selection of a device must allow array library to locate a queue/stream to submit tasks to, and for optimal performance's sake streams/queues should be reused once created.

Stream management seems like something that is done at the implementation level, and does not show up in a Python level API.

Python user wishing to offload a computation to GPU should be aware of stream/queue

I'm not sure this is right. It's of course very helpful to understand what's happening under the hood, it can allow users to write more performant code. But it's not strictly necessary, and the vast majority of Python end users will actually not understand this while happily writing ML code to run on a GPU for example.

Note that there typically are ways to control this from Python (e.g., for PyTorch, torch.cuda.synchronize, torch.cuda.stream, etc.)

Thus to address a device in an array library, a user needs to specify a library-specific layer where the library stores backend-specific (...) objects

This doesn't follow from what you wrote before I think. An array library needs to deal with this (e.g., as you wrote, "store it with each array object instance"), but that should be an implementation detail.

A library may choose to query its backend for all addressable devices, store them in an array and refer to devices by their position therein. Such position should be deterministic between different Python sessions.

Device IDs should already be deterministic right? E.g. the device ID of writing 'gpu:0' (or 'cuda:0', depending on which library is used) should give you the GPU with the actual device ID of 0 that nvidia-smi shows you as well, and that you can also control with CUDA_VISIBLE_DEVICES.

A portable code will then only use (device kind, relative id).

Agreed it needs kind + ID. No need for it to be relative though I'd think.

Array object should carry, or should be able to figure out the device it was allocated on (e.g. by implementing array_instance.device), and a user should be able to specify the device a new array is to be allocated on as well, which either calls for device= keyword for every function that may create a new array, with some sensible default value

Agree with all of this.

Computations on device arrays must be submitted to a stream/queue. [...]

All true but out of scope I'd say, it's per-library and may be implementation-specific.

The user should be able to specify the stream/queue where the kernel are to be submitted

Even though existing libraries do offer APIs for stream control, it's not often used and it's not clear that we can point at anything being best practice / the right thing to adopt. I'd say we should put this out of scope. Mixing multiple libraries + stream control is immature, I think we also had a discussion around the array interchange protocol here (which doesn't contain stream info).

@rgommers
Copy link
Member Author

rgommers commented Nov 30, 2020

Some thoughts on API

Syntax:

  1. A device= keyword for creation functions
  2. That device= keyword should take a string representation as well as an instance of a device object.
  3. That device object itself should take the same string representation in its constructor
  4. That device object should also provide a string attribute, to give a portable representation which is again a valid device-specifying string (I propose device.str). TBD: also provide .kind and .index separately, or do not rely on any other properties for device instances?
  5. The string representation should be 'device_kind:id', with :id' optional (e.g. doesn't apply to 'cpu'). All lower-case, with kind strings 'cpu', 'gpu' (note, better than 'cuda', and applies to AMD GPUs too), 'tpu'. No other strings needed at this time (?).
  6. Moving arrays to a different device needs to be explicit, with a .to(device) method
  7. The array object should get a .device property, which returns a device object instance

Behaviour:

  • Operations involving one or more arrays on the same device should return arrays on that same device
  • Operations involving arrays on different devices should raise an exception
  • The default device will normally be the CPU for libraries that support it, but this is not a requirement. There must be a default device, but it can be anything. Note: the default device can vary based on available devices, e.g. for JAX it's GPU 0 if available, CPU otherwise. This should be documented in the standard as being possible and implementation-defined, with the recommendation to use explicit device passing for portability.
  • device object instances are only meant to be consumed by the library that produced them - the string attribute can be used for portability between libraries.

Out of scope:

  • Setting a default device globally
  • Stream/queue control
  • Distributed allocation
  • Memory pinning

To add or not add a context manager:

A context manager for controlling the default device is present in all libraries except NumPy. Concerns are

  • (from issue description): TensorFlow has an issue where its .shape attribute is also a tensor, and that interacts badly with its context manager approach to specifying devices - because metadata like .shape typically should live on the host, not on an accelerator.
  • A context manager can be tricky to use at a high level, since it may affect library code below function calls (non-local effects). See, e.g., [FR] torch context with default device & dtype pytorch/pytorch#27878 for a current PyTorch discussion on a good context manager API.

The main upside would probably be that since most libraries have context managers now, it'd be nice for their users to get a context manager in this API standard - easier to migrate already written code.

@rgommers
Copy link
Member Author

I looked at SYCL some more, since I was only very vaguely aware of what it is. @oleksandr-pavlyk please correct me if I'm wrong, but as far as I can tell it's orthogonal to anything one would want to know a Python user or array-consuming library author to know about. They need to know about the actual hardware they can use to execute their code, but (beyond install headaches) won't really care about if it's CUDA/OpenCL/ROCm/SYCL/whatever under the hood.

@oleksandr-pavlyk
Copy link
Contributor

With SYCL, one writes a kernel once, compile it with a SYCL compiler to an IR, and then you can submit it to different queues targeting different devices (i.e. CPU, GPU, FPGA, etc.).

This example constructs a Python extension, compiled with Intel's DPCPP compiler, to compute column-wise sums of an array.

Running it on CPU/GPU is a matter of changing a queue to submit the work to:

with dpctl.device_context('opencl:gpu'):
    print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
    print(sb.columnwise_total(X))

with dpctl.device_context('opencl:cpu'):
    print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
    print(sb.columnwise_total(X))

Array consuming library author need not be aware of this, I thought, just as he/she need not be aware of which array implementation is powering the application.

rgommers added a commit to rgommers/array-api that referenced this issue Dec 3, 2020
@agarwal-ashish
Copy link

Some thoughts and clarifications:

TensorFlow's ndarray.shape returning an ndarray is a behavior that will be rolled back. Tensor.shape's behavior is to return a TensorShape object which can represent incomplete shapes as well, and that will carry over to ndarray as well.

It is not clear why device needs to be part of the array creation APIs. Context managers can allow mutating global state representing the current device which can be used in the runtime for these calls. Device setting code would only execute when the context manager is entered / exited. Passing device per call can be an unnecessary cost. Also it may force the current device to be known which can be hard for generic library code and may require it to query the existing device from somewhere.

Also, I am not sure we should enforce constraints on where inputs and outputs can be placed for an operation. Such constraints can make it harder to write portable library code where you don't control the inputs and may have to start by copying all inputs to the same device. Tensorflow runtime is allowed to copy inputs to the correct device if needed. Also there are policies on hard / soft device placement which allow TensorFlow to override user specified placement in cases where the placement is infeasible or sub-optimal. One can further imagine a dynamic placement scenario in cases of async execution.

In addition, outputs may some times need to reside on different device as compared to inputs. Examples often involve operations involving metadata (shape, size) that typically resides on the host.

Device placement is generally a "policy" and I think we should leave it as a framework detail instead of having it in the API specification. I am not opposed to reserving a device property in the ndarray API, but I don't think we should put constraints on how the device placement should be done.

@rgommers
Copy link
Member Author

rgommers commented Dec 3, 2020

TensorFlow's ndarray.shape returning an ndarray is a behavior that will be rolled back. Tensor.shape's behavior is to return a TensorShape object which can represent incomplete shapes as well, and that will carry over to ndarray as well.

That's good to know. In that case I'll remove the note on that, no point in mentioning it if it's being phased out.

It is not clear why device needs to be part of the array creation APIs. Context managers can allow mutating global state representing the current device which can be used in the runtime for these calls.

The "mutating global state" points at the exact problem with context managers. Having global state generally makes it harder to write correct code. For the person writing that code it may be fine to keep that all in their head, but it affects any library call that gets invoked. Which is probably still fine in single-device situations (e.g. switch between CPU and one GPU), but beyond that it gets tricky.

The consensus of our conversation in September was that a context manager isn't always enough, and that the PyTorch model was more powerful. That still left open whether we should also add a context manager though.

Passing device per call can be an unnecessary cost.

Re cost - do you mean cost in verbosity? Passing through a keyword shouldn't have significant performance cost.

Also it may force the current device to be known which can be hard for generic library code and may require it to query the existing device from somewhere.

I think the typical pattern would be to either use the default, or obtain it from the local context. E.g.

def somefunc(x):
     ....
     # need some new array
     x2 = xp.linspace(0, 1, ...., device=x.device)

And only in more complex situations would the actual device need to be known explicitly.

Also, I am not sure we should enforce constraints on where inputs and outputs can be placed for an operation. Such constraints can make it harder to write portable library code where you don't control the inputs and may have to start by copying all inputs to the same device. Tensorflow runtime is allowed to copy inputs to the correct device if needed.

That is a good question, should it be enforced or just recommended? Having device transfers be explicit is usually better (implicit transfers can make for hard to track down performance issues), but perhaps not always.

Also there are policies on hard / soft device placement which allow TensorFlow to override user specified placement in cases where the placement is infeasible or sub-optimal. One can further imagine a dynamic placement scenario in cases of async execution.

Interesting, I'm not familiar with this hard/soft distinction, will look at the TF docs.

In addition, outputs may some times need to reside on different device as compared to inputs. Examples often involve operations involving metadata (shape, size) that typically resides on the host.

That should not be a problem if shape and size aren't arrays, but either custom objects or tuples/ints?

Device placement is generally a "policy" and I think we should leave it as a framework detail instead of having it in the API specification. I am not opposed to reserving a device property in the ndarray API, but I don't think we should put constraints on how the device placement should be done.

That may be a good idea. Would be great to discuss in more detail later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: Device Handling Device handling.
Projects
None yet
Development

No branches or pull requests

4 participants