-
Notifications
You must be signed in to change notification settings - Fork 53
device support #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 to this. We will need to define an operation to explicitly move the data around among different devices, along with a canonical way of specifying the target device. |
GoalsIt is not uncommon nowadays to get a laptop that has a CPU, an integrated GPU and a discrete GPU. Device selection specification should allow to select among the subsets of supported devices, but ensuring that each computational device can be unambiguously referenced. Moreover, the specification should allow to select among several graphics cards, or more generally accelerators, of the same kind (say among multiple GPUs from the same vendor). Statements about offloadingIt is worth noting that device really refers to a tuple of (hardware, driver), i.e. an NVidia card can be programmed using either CUDA or using OpenCL, an AMD card can be programmed either using ROCm or OpenCL, an Intel card can be programmed using Level-Zero or OpenCL, etc. It may not be very important for a Python user to be able to select among available drivers dynamically. It is reasonable that an array implementation selects a driver for the device at module initialization stage. To work with a device an associated runtime keeps a structure, Runtime facilitates the task of imposing order on the sequence of tasks to be offloaded for an asynchronous execution at a device. CUDA provides Ultimately, user's selection of a device must allow array library to locate a queue/stream to submit tasks to, and for optimal performance's sake streams/queues should be reused once created. Reusing SYCL queue has an additional benefit of keeping track of data dependency while executing concurrently. Python user wishing to offload a computation to GPU should be aware of stream/queue, but for most users they are going to be created once by an array implementation per device and reused through the session. Power users may want to explore use of multiple streams, or define task dependencies on SYCL tasks. Toward the proposalAn array library targeting devices needs to know the device the array was allocated in. In the case of SYCL and OpenCL it also needs to know the context to which the memory was bound, for example to be able to free the memory. In the case of OpenCL, one can use function However when using SYCL 2020's USM pointers, the context can not be easily looked up, so array library must either store it with each array object instance, or it should be stored in a global structure associated with the device and all operations involving the device must use that context, as is the case with PyTorch. Thus to address a device in an array library, a user needs to specify a library-specific layer where the library stores backend-specific (backend here being software layer to work with devices, e.g. OpenCL, CUDA, SYCL, etc.) objects (context, queue/stream, etc.) associated with supported devices, as well as an identifier of a device in the backend. A library may choose to query its backend for all addressable devices, store them in an array and refer to devices by their position therein. Such position should be deterministic between different Python sessions. Considering an array library that uses SYCL runtime as its backend, devices can be further differentiated based on their kind: GPU devices, Accelerator devices, CPU devices, Custom devices. It would be appropriate for the array library to keep separate arrays of addressable devices per each kind. Thus the device specification emerges to be a triple ('backend', 'device_kind', relative_id). Since backends would vary from array library implementations, array library must be able to provide a device with some A portable code will then only use (device kind, relative id). Array object should carry, or should be able to figure out the device it was allocated on (e.g. by implementing Computations on device arrays must be submitted to a stream/queue. Implementation kernels must be able to access the data, so the queue must use the same runtime The user should be able to specify the stream/queue where the kernel are to be submitted, like it is the case with native libraries (c.f. in cuBLAS see Ideally this should be seldom needed, so an optional keyword seems heavy-weighted, but could be a solution, with defaults driven by device array inputs and the default queue settings of the library, controllable via context and/or via explicit function. |
I agree, this is what libraries currently do so there's no need for something more complex.
Stream management seems like something that is done at the implementation level, and does not show up in a Python level API.
I'm not sure this is right. It's of course very helpful to understand what's happening under the hood, it can allow users to write more performant code. But it's not strictly necessary, and the vast majority of Python end users will actually not understand this while happily writing ML code to run on a GPU for example. Note that there typically are ways to control this from Python (e.g., for PyTorch,
This doesn't follow from what you wrote before I think. An array library needs to deal with this (e.g., as you wrote, "store it with each array object instance"), but that should be an implementation detail.
Device IDs should already be deterministic right? E.g. the device ID of writing
Agreed it needs kind + ID. No need for it to be relative though I'd think.
Agree with all of this.
All true but out of scope I'd say, it's per-library and may be implementation-specific.
Even though existing libraries do offer APIs for stream control, it's not often used and it's not clear that we can point at anything being best practice / the right thing to adopt. I'd say we should put this out of scope. Mixing multiple libraries + stream control is immature, I think we also had a discussion around the array interchange protocol here (which doesn't contain stream info). |
Some thoughts on APISyntax:
Behaviour:
Out of scope:
To add or not add a context manager: A context manager for controlling the default device is present in all libraries except NumPy. Concerns are
The main upside would probably be that since most libraries have context managers now, it'd be nice for their users to get a context manager in this API standard - easier to migrate already written code. |
I looked at SYCL some more, since I was only very vaguely aware of what it is. @oleksandr-pavlyk please correct me if I'm wrong, but as far as I can tell it's orthogonal to anything one would want to know a Python user or array-consuming library author to know about. They need to know about the actual hardware they can use to execute their code, but (beyond install headaches) won't really care about if it's CUDA/OpenCL/ROCm/SYCL/whatever under the hood. |
With SYCL, one writes a kernel once, compile it with a SYCL compiler to an IR, and then you can submit it to different queues targeting different devices (i.e. CPU, GPU, FPGA, etc.). This example constructs a Python extension, compiled with Intel's DPCPP compiler, to compute column-wise sums of an array. Running it on CPU/GPU is a matter of changing a queue to submit the work to: with dpctl.device_context('opencl:gpu'):
print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
print(sb.columnwise_total(X))
with dpctl.device_context('opencl:cpu'):
print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
print(sb.columnwise_total(X)) Array consuming library author need not be aware of this, I thought, just as he/she need not be aware of which array implementation is powering the application. |
Some thoughts and clarifications: TensorFlow's ndarray.shape returning an ndarray is a behavior that will be rolled back. Tensor.shape's behavior is to return a TensorShape object which can represent incomplete shapes as well, and that will carry over to ndarray as well. It is not clear why device needs to be part of the array creation APIs. Context managers can allow mutating global state representing the current device which can be used in the runtime for these calls. Device setting code would only execute when the context manager is entered / exited. Passing device per call can be an unnecessary cost. Also it may force the current device to be known which can be hard for generic library code and may require it to query the existing device from somewhere. Also, I am not sure we should enforce constraints on where inputs and outputs can be placed for an operation. Such constraints can make it harder to write portable library code where you don't control the inputs and may have to start by copying all inputs to the same device. Tensorflow runtime is allowed to copy inputs to the correct device if needed. Also there are policies on hard / soft device placement which allow TensorFlow to override user specified placement in cases where the placement is infeasible or sub-optimal. One can further imagine a dynamic placement scenario in cases of async execution. In addition, outputs may some times need to reside on different device as compared to inputs. Examples often involve operations involving metadata (shape, size) that typically resides on the host. Device placement is generally a "policy" and I think we should leave it as a framework detail instead of having it in the API specification. I am not opposed to reserving a device property in the ndarray API, but I don't think we should put constraints on how the device placement should be done. |
That's good to know. In that case I'll remove the note on that, no point in mentioning it if it's being phased out.
The "mutating global state" points at the exact problem with context managers. Having global state generally makes it harder to write correct code. For the person writing that code it may be fine to keep that all in their head, but it affects any library call that gets invoked. Which is probably still fine in single-device situations (e.g. switch between CPU and one GPU), but beyond that it gets tricky. The consensus of our conversation in September was that a context manager isn't always enough, and that the PyTorch model was more powerful. That still left open whether we should also add a context manager though.
Re cost - do you mean cost in verbosity? Passing through a keyword shouldn't have significant performance cost.
I think the typical pattern would be to either use the default, or obtain it from the local context. E.g.
And only in more complex situations would the actual device need to be known explicitly.
That is a good question, should it be enforced or just recommended? Having device transfers be explicit is usually better (implicit transfers can make for hard to track down performance issues), but perhaps not always.
Interesting, I'm not familiar with this hard/soft distinction, will look at the TF docs.
That should not be a problem if shape and size aren't arrays, but either custom objects or tuples/ints?
That may be a good idea. Would be great to discuss in more detail later today. |
For array creation functions, device support will be needed, unless we intend to only support operations on the default device. Otherwise what will happen if any function that creates a new array (e.g. create the output array with
empty()
before filling it with the results of some computation) is that the new array will be on the default device, and an exception will be raised if an input array is on a non-default device.We discussed this in the Aug 27th call, and the preference was to do something PyTorch-like, perhaps a simplified version to start with (we may not need the context manager part), as the most robust option. Summary of some points that were made:
.shape
attribute is also a tensor, and that interacts badly with its context manager approach to specifying devices - because metadata like.shape
typically should live on the host, not on an accelerator.device=
keywordspmap
s can be decorated to override that. The different with other libraries that use a context is that JAX is fairly (too) liberal about implicit device copies.Links to the relevant docs for each library:
Next step should be to write up a proposal for something PyTorch-like.
The text was updated successfully, but these errors were encountered: