cuda.txt

 Good morning, you can hear me? So we are not going to have the classic setup that we had in the classroom, but I hope it will work. Okay, so good afternoon, good evening everyone. This is the first lecture of the Freshly in Action course, GP11. I will start by presenting myself. I am a third year PhD student in Information Technology here at Politecnico. and since my background is not actually in computer science, my research mainly focuses on high performance computing techniques applied to the genomics domain. So, firstly, some logistics about the course. For the attention of these passion in action is recommended that you have at least a bit of knowledge in basic programming. CNC++ works just fine. So this is the basic requirement. We are going to write C code so at least write in a lower than expected. Okay, it's not necessary to have a background in computer architecture or parallel programming. Of course, this is going to be useful if you know that this is going to be like a nice lab, but it's not like recommend, mandatory lab. So, I mean, I was like a bio exo and I got through the the course and is able to write CUDA code. So it's not mandatory to have like AXO or ARCA courses attended. This, the evaluation for the course, well it's a sort of evaluation so this means that if you want to have this course validated you need to submit the project. Remember that the credits that you are going to obtain attending the course are not valid for graduation, so beware of that. So this is like an extracurricular course, so everything you do here is not going to qualify for graduation. So I would like by the end of the course for you to submit a project. okay so the goal is like you have a problem okay I will give you the C code baseline okay and then the idea is to report it to GPU and to optimize the code in order to make it run faster using GPUs okay I also have like a tentative schedule for the course so the idea is to have all the projects submitted by the end of January so that you have more than almost two months in order to do the project and the presentation after the beginning of the courses so that you have time to take all your exams and we will meet again in February. More information will be given later and by this I mean I will assign you the project later in the course So before having the project assigned, we will have done at least the majority of the lectures. Okay, so I will try to stream all the lectures, okay? Internet, I mean, depends on the internet that we have here, so sometimes it works, sometimes not. And I will try to record all the lectures and make them available at the folder that I will share with you guys via email. I will upload ASAP, so at least give me some time to gather the material and upload it. This is my email, you already have it. You can contact me, for other vendors, by using the GPU101 tag at the beginning of the email subject. subject this is for me in order to understand what is right for me, so I know that you are the guys attending the PIA and I will try to upload it as soon as possible this reminds me I was not recording so let me start Okay, so this is a GPU course, so of course you will need access to a GPU. There are two main routes depending on whether you already have it on your PC. Okay, so if you have it, you will have all the instructions to download and install CUDA. So CUDA, the toolkit and together with the compiler, you can, like on Windows, you can or either install WSI2 and then install CUDA with the compiler on the stroke. Otherwise, on Linux, if you have a Linux system, then you can directly download CUDA and use it as it is. If you don't have access to a GPU, then you can use Google Collaboratory. In this repo you will find all the instructions to enable the compiler through the Jupyter notebook that you have in Google Colab. So actually what you have is like, it opens a terminal in the box of the notebook. So you are able to compile the code, you upload it to Drive, you compile it and then you execute it on the Collabs GPUs. So in these repositories there are all the instructions. Okay? Last info, I updated the calendar of the course with respect to what was said to you through the website. Okay? But besides this one, all the lectures will be on Thursday. Okay? So, no longer lessons on Monday. all the lectures are on Thursday, still in this room, always in this room. So, questions so far? Maybe something that I missed or you have some questions about the logistics of the course? So firstly, why it is necessary to provide you a course related to CUDA programming, to GPU programming. We are in the so-called big data wave, so all the applications that we have, all the knowledge that we have comes from data. So, I mean, OpenAI and all the big players in the AI field, their true power lies in the data and in the fact that we are able to process data and we are able to extract information. So there are a lot of applications that benefit from data processing and fast data processing. So for example, in financial analysis, one application might be high frequency trading. We have a lot of data that needs to be processed very quickly in order to be able to adjust the stock prices, etc. For example, scientific simulations have a very large amount of data. Let's think about the simulations that were done when scientists were trying to identify a COVID vaccine. So we have all this data that is related to molecules, to atoms, that needs to be combined together in order to find the proper interaction between molecules that allows you to create a vaccine. This also means that all this processing that we have to do must be done in a very short amount of time. I mean, it's critical. Imagine that we didn't have the computing power necessary to find all the possible interactions between the molecules, then we wouldn't have a vaccine. So these are just some examples where the data plays a huge role and the computing power is fundamental in order to extract the knowledge. The problem is that we are reaching the so-called end of Moore's law. So the Moore's law is an observation that is based on the fact that the number of transistors in a processor was doubling, OK, was doubling, every about two years without an increasing cost. So we were able to add computing power to processors without increasing the cost. and so increasing the transistor count would translate into increasing the computing power. The problem is that we are not able to do this thing anymore. We are reducing the transistor size so much that we cannot go beyond the physical limits. So we are not able to fit more transistors and then the computing power of our system is limited. What can we do? There are two ways that we can pursue in order to overcome this problem that we have in terms of performance. There are two ways. The first one is the use of new technologies. Let's think about, for example, quantum computing. It's very promising. It's a new technology that allows you to get more performance out of your system. But the problem is that it's not really feasible right now. So the other way that we can pursue in order to overcome all the performance issues that we have when using classical CPUs, because we are talking about classical CPUs, is the creation of what is called domain specialized systems. So the idea when creating domain specialized systems is to, we have an application, we profile the application, so we identify the portion of the program that is going to take like 90% of the execution time of our application. So we identified the bottleneck, the small portion of our application that is going to account for the longest time in the actual execution time. So we can identify that code and what we can do is create a specialized accelerated application that is going to be integrated in our system. system and this allows us to overcome the problem in performance. For example, let's say we have an application that runs for 100 seconds, but let's suppose we have a bottleneck that accounts for 70 seconds. This is the majority of the execution time. So we are going to create a domain specialized system that is going to implement the 70 second task. OK, and reduce it to like 10 seconds. Then what we have is a system, is an heterogeneous system because we will have our application running on CPU. That is the one that is taking 30 seconds. Then we have the 10 second one running on a device, an external device. And the entire application is going to take 10 plus 30 seconds in total. So this is the idea behind creating domain specialized systems. Which is the type of device that we can use in order to implement this type of systems? Well, we can use heterogeneous architecture in general, for example FPGAs, but we can use GPUs. OK, they are more common and have several benefits. OK, so which are the benefits of using GPU based processing? So firstly, GPUs provide massive parallelism. This means that there are thousands of cores, so thousands of minimal compute units on the GPUs that all carry out the same task in parallel, breaking down the entire execution time by a certain factor. They perform many calculations simultaneously, And this is done by exploiting parallelism. There are several paradigms for parallelism. The most common, let's say, for GPUs is single instruction multiple data. That means we have a single instruction that is going to be executed by all the cores together simultaneously on different pieces of data. For example, we have matrix multiplication, what we are doing is taking all the elements between the two matrices that we are multiplying, performing the operation and scoring the result, then this is going to be broken down into smaller instructions and mapped depending on the data that we are processing. of the massive parallelism that we have in GPUs, we are able to obtain high throughput. So the throughput is like, we can measure the throughput as the number of results that is given us in a certain amount of time. So we can have a very high throughput because thanks to the parallelism that we have in GPUs. i-throughput also means a reduced execution time. So we can achieve the so-called real-time processing by using GPUs, so breaking down the computation in smaller pieces and parallelizing the computation, so the key is the parallelization of the task, we are able to reduce the computation time from hours, hypothetically, to minutes. So this is going to be like real-time processing. Modern systems also are not equipped with a single GPU, like our normal PCs, but often They are multi-GPU systems, so each CPU is going to be connected to various devices. They all work in parallel, so we can benefit from an added level of parallelism. This is going to increase the compute power of the entire system. Imagine also that modern systems, modern supercomputers are not just equipped with one node, but they have thousands of them. So we have tens of thousands of GPUs in a modern supercomputer. Lastly, another crucial point is the CUDA ecosystem and environment support. CUDA is more than a framework, we will see later in the lecture. It also provides programming models and support that is going to be very helpful for us programmers who want to exploit GPUs. So we will have a way to deep interact at a very low level with all the GPU components. In order to do so, we need to understand how a GPU is built. This is a scheme of the A100 GPU by NVIDIA. It's not the latest, but it's a nice example of what we can find on a GPU. Questions so far? Am I going too fast or too slow? If you don't understand anything just interrupt me with your hand and... What's the difference between having two GPUs and a GPU with the double number of cores? Like the single instruction multiple data? Single instruction multiple data works within the single GPU. Okay. I mean, like, you say, which is the difference between having two smaller GPUs and one big GPU? It depends. Sometimes it may be depending on the workload. So sometimes we have certain parallelism patterns that might benefit from fewer resources but that are parallelized to different devices. Whereas for other workloads it might be better to have one single chip, a giant chip, on which you have full parallelization. Let's say for example, if the GPU is going to be... Imagine that we have a worker that makes a big GPU, underutilized. That maybe sometimes it could be better to switch to two devices and then parallelize the computation on the two smaller devices. Because they can reach a better performance. It depends on the cases. It surely depends on the application, a lot of times. Okay, so the device structure. So as I was saying, this is like the high-level overview of the VA100. So, first component. The first component is the classical RAM. Remember that the GPU is separated with respect to the CPU. Okay, remember that they are two distinct devices and we need something that allows them to communicate. And this component is the RAM. In this case, for the most recent GPUs is the HBM. It's a type of memory that is designed for high performance and memory intensive tasks. As the name suggests, it allows high bandwidth, so you can transfer big chunks of data through the memory channels. And this is the main component that allows you to communicate with the CPU. So if you have your data on the CPU, what you are going to do is to copy the data into the GPU's HVN. So this is the first thing that we are going to do before doing any kind of computation on the GPU. So you transfer the data on the HPM on the RAM of the device. Then at this point, the GPU cores are able to access the data in the HPM. Then in the same way, to go back to the CPU, then we are going to store the result in the HPM the HBM and then the CPU is going to be able to access the data. OK. Remember that memory spaces are very distinct. OK, this is going to be key later when we write the programs. So this is like the HBM is going to have is very abundant in terms of memory. For example, there are GPUs with 40GB of HPM, I've seen GPUs with 128GB, so it truly depends on the device. But this is the largest memory resource that we have on the device. But it's also the slowest. When I try to access the data that is stored in HBM, the time required to carry out this operation is much higher than using GPU caches. In this sense, we have the first type of cache, the L2 cache, and it is a critical part of a GPU and it is going to store the intermediate results when we access the HBM. So there is the cache hierarchy that you may or may not have seen during courses of computer architecture. But, as I said, this is going to be an intermediate between the HPM, so the off-chip memory, and the local caches. Local caches belong to streaming multiprocessors. Streaming multiprocessors are fundamental computational units of the GPU, GPU and they are the components containing the GPU cores. Cores and local caches are all containing the so-called streaming multiprocessors. Streaming multiprocessors also play a huge role when scheduling the computation. So we cannot have computation that is split across two streaming multiprocessors. But this is kind of transparent to the user. But sometimes it is important to know that when we have a piece of computation that is going to be carried out in our program, we know that it's physically residing into a single a dual streaming multiprocessor. What do we have in two streaming multiprocessors? We have CUDA cores, so the cores, part of the thousands of cores that we have parallelizing the computation. They are also able to perform floating-point operations. Let's just think about neural networks. They are not integer values, we need floating point values. But also, still speaking of neural networks, we have also tensor cores that are a type of specialized unit, still residing into distributed multiprocessors. And tensor cores are like fundamental components that are used to perform matrix multiplications. So for example, when we perform convolution between the layers of a neural network, all these matrix multiply operations are breaking down into smaller pieces and each of the tensor cores carries out a specific portion of the entire computation. There are also other specialized units for shading and tracing, and there are even more, I think, in fluorescent GPUs. You can find them and use them depending on your application. This was regarding the computational part of the string multiprocessor. Regarding memory, instead we have the L1 cache, that is going to be used together with the L2 when writing and storing results. Also, for example, this is used when I have an intermediate result of my computation, I write it into the L1, well the GPU is going to write it into the L1, but before committing it to the L2 and then to the HPM, maybe it can be reused so that I have the value, and this way I have the value cached and I have a shorter access time with respect to accessing the data to the HP. Then we have registers. These are the fastest form of memory that we have in the streaming multiprocessor and they are fundamental when storing intermediate results, even more than in one cache. And they are local to threads. We will see what trade-offs are later. How can we program GPUs? So this is nice, nice to have a lot of cores, a lot of specialized units, but if we cannot program them, if we cannot use them, they are pointless. So we have CUDA, of course this is going to be a course focused on on NVIDIA GPUs. We will see, I think, in the last lecture, other frameworks that allows you to program GPUs that are not vendor-specific. But in this case, we are going to focus on CUDA. OK? CUDA has been introduced in 2006 with NVIDIA Tesla architecture. OK, so it's been around for a while. And it exposes a C++-like language that allows you to write CUDA programs that are going to be run in the GPU. And the GPU is going to be our accelerator in this case. The nice thing about CUDA is that it allows you to have a kind of fine-grained management of the GPU resources. Ideally you can program every single component in the GPU, depending on what is exposed to the programmer, of course, by CUDA. But you can do a lot of nice things using CUDA. So, more generally, this is a platform designed jointly at the software-advert level to make use of GPUs in general-purpose computing. We are not focusing on graphics applications. You can do them. Gaming makes a very big use of GPUs, of course. were like GPUs were born from that, they are graphics processing units. But we can do more than graphics and we can do more than machine learning, sorry, but I have to say that we can do more and we can program the devices at a finer level. So CUDA allows to program the GPU with minimal extensions to enable heterogeneous programming and attain an efficient scalable execution. So what you can do is to write your C code and then port it to CUDA and the syntax that is going to be used by CUDA program is very similar to us. CUDA of course offers a driver that allows device management. So there are some kind of API that are exposed to the programmer that the programmer can use to manage the device. For example I want to know if the device is used by someone, I have an API for that. I want to know how much memory, how much HPM my device has because maybe it's the first time that I use it, so I think I have an API dedicated for the task. But also CUDA exposes GPU parallelism for general-purpose computing via a number of multiprocessor endowed with cores and memory hierarchy. So this means that CUDA offers a programming model, an architectural model and a memory model that allow you to program the device and extract the most parallelism out of your application. Parallelism is the key to have a high performance application. So first we have an architectural model that is based on many processing cores Grouped in multiprocessors who share a SIMT control unit. We will see what is SIMT. Okay, SIMT. It's close to SIMD, but it's not the SIMT. CUDA offers a programming model that is based on massive data parallelism and pan-grain parallelism. What I was saying, you can divide the task, but also you can program the resources that are going to be used to carry out the task. and this is scalable. The code is executed on a different number of cores and this is done by compiling the code on events. Also we have a very specific memory model that allows CUDA to expose caches to the programmer, so we can use the caches and tune the use of caches. Lastly, CUDA, well there are two main concepts in CUDA that are used for versioning, let's say. So firstly we have the compute capability. Compute capability means it's a, let's take a number, okay, so it defines, depending on the generation of your device, the compute capability defines the type of features that your device has in terms of computing. So the major number identifies the architecture, let's say Volta, Turing, Hopper or Blackwell, the most recent. Then we have the minor, that is like the revision of the generation of the GPUs. Let's say for example for the A100 the compute capability is 8.0, but then we have smaller GPUs, for example the A5000, that has compute capability 8.6. So we know that the generation of the two GPUs are the same but they differ. The 0 in the A100 says okay it's a data center. The 6 tells me oh it's a revision probably smaller than the data center version. Then we have the CUDA version. The CUDA version is simply the revision of CUDA. This is done for compatibility and retro compatibility. And also each version of CUDA introduces new features. So if your GPU is able to support the latest version of CUDA, then probably you are able to do something fancier than using the previous version. But mostly, for retro compatibility also, CUDA also includes software features that are independent of the hardware generation. So there are some features, there are for example APIs that can be used with devices with compute capability 4, now that we are at 9 or 10. So there are some things remaining throughout the course and all the revisions of CUDA. And these are all the mainly the APIs and all the building blocks of CUDA. Okay, so far any question? Okay, so we said that we have an architecture model, a programming model and a memory model. Let's start with the programming model. Okay, the programming model is the model that we are going to use in order to expose parallelism in the application and implement it on GPU. So the programming model, the goal of the programming model is to abstract the complexity of the GPU hardware, and this allows all the programmers to exploit GPU computing power. So as I said before, CUDA operates in a heterogeneous computing environment because we have two processors that are going to run programs together. On one side we have the host application, that is the one running on the CPU, that is going to be paired with the device, the GPU, in CUDA is called device, that is going to compute some kind of program when invoked by the host of the game. So what we have is the CPU calling the GPU, not the other way around. Okay. So in the term, the correct term is like we offload the computation from the CPU to the GPU. So, okay, the GPU dependent on the CPU. The parallel task delegated by the host, so the function that the device is going to execute, is called kernel. The kernel is the function that is going to be executed by the GP. It's a function in CUDA syntax, it is a function and it's called kernel. Everything good so far? So in order to program efficiently a GPU we have a programming model and the core of our programming model is the thread. The GPU thread is the smallest unit of parallelep execution in CUDA. So each thread is going to execute the kernel code independently from each other, with its own data and its own registers. Registers are the smallest memory unit that we have on streaming multiprocessors, and each thread is going to have a set of dedicated registers that are going to be accessed only by the thread. So if I'm a thread, I have 10 registers, I can access those 10 registers and all the other threads cannot access my registers. But let's say they are private. Sometimes you can switch the data. But I have my registers, I can access my registers. We said that the key is to explore parallelism and threats are running in parallel. How are all the threats organized? Threats are organized in blocks. So we have the threat and the blocks. Blocks are groups of threats. Threats in a block execute concurrently on a streaming multiprocessor. So what I was saying before, that was like pieces of computation cannot be split across streaming multiprocessors, means that each block, when it's assigned to a streaming multiprocessor, is going to be executed on that streaming multiprocessor until the end of the computation. Blocks can communicate across each other using, for example, HVM data. And the blocks can be monodimensional, bidimensional, multidimensional. This means that if we have, for example, matrix computations, we can map all the data on cells of a matrix. So for example, suppose we have a matrix that is 3 by 4, and we want to add, for example, 5 to each element of the matrix. What can we do? Is map a thread to each cell. Each thread is going to add 5 to its own element, the element it's been assigned, and then we can perform the computation. For example, let's say the operation takes 5 clock cycles, in fact, because all these threads perform the computation in parallel. Otherwise, we would have 5 times the number of threads, in order to... The number of clock cycles would be 5 times the number of threads to get the same result, if done sequentially. We have actually a more fine-grained grouping of threads within blocks, because the GPU, when spawning threads, does not actually spawn single threads independent of each other, but they are grouped in warps. A warp is a group of 32 threads, and these 32 threads execute the same instruction at the same time in Logstep. So when, for example, we have a block that contains 32 threads, this block is featuring a single warp, And all the threads in this warp can communicate by shifting data, and we can synchronize the computation within our warp. This means that the operation that we are doing is carried out by 32 threads simultaneously. And this is the base, the foundation of the CMT programming paradigm. So we have a single instruction that is going to be computed by multiple threads. Multiple means 32 threads. 32 is related to NVIDIA GPUs. For example, AMD GPUs feature 64 threads. It's not called warp, but it's called, I think, wavefront, but the idea is the same. OK, so we have a SQL instruction that is going to be executed in lockstep by multiple threads grouped in groups of 32. Each thread operates on its own data element. Okay. What about a scenario where we have, for example, 34 threads? What the GPU is going to do? Because we spawn 34 threads. Okay. We say, okay, each block is going to have 34 threads. The GPU, under the hood, what the GPU is doing is spawning a warp plus another warp, but only two, the first two of those 32 threads are going to be active. So all the others, we won't see them, we cannot program them, but the GPU is going to schedule two warps, because we have asyn programming founding. So each instruction is going to be executed by 32 bits in lockstep at the same time. Okay? So this also means that if we have two warps, we can interleave the execution of the instruction of a warp. For example, we have the first warp that is doing instruction one, then we have the second warp doing the same instruction. Depending on how the instructions are issued by the scheduler. Okay. Yes. So is it possible for different words within a block, within a thread block, to give different instructions or are the instructions also the same at the block level? You can do different things. Well two warps. So thread within the same warp are doing the same stuff. Different warps in the same block, I don't think they can do different stuff. The idea of warps is just a physical level of the agent? Yes. I don't think you can do that, but I need to check. Do we directly interact with warps? You can do that. There are the so-called warping trin-6. There are instructions that allow you to manage the single warp. For example, it depends on what you mean by doing different things. Okay, so for example, you can compute, execute an instruction only with a single warp and the other warp does nothing. Okay, but that depends on the task that you are doing. You can do that. Okay, and you can manage the single warp. There are instructions that allows you to, for example, reduce the data. So for example, let's say we want to do a parallel reduction between the data that across the data that we have in a warp, there is an instruction to do that. You can synchronize, so let's say put a barrier for warps, so let's say for example I want to do a reduction in the warp, the other warp does nothing, ok, I need a synchronization point at the circuit point. And you can do that at the warp level, yes. Ok, thanks. Okay, last thing about the works. Yes. If I have 34 threads, the two threads that are in a warp, the other threads, are they activated? They are idle, it means they do nothing. The computation is going to... When you use data, for example, to the APIs, to determine the block size, this is the number of threads in a block, this API will return you 34, because you scheduled 34 threads in a block. The others are idle, so they do nothing. They are kind of deactivated. They are there but they basically do nothing, no kind of computation. They are just waiting for the warp to complete the computation. Yes? If I ask for 34 sets, can I, like, maybe at runtime change the two extras to, like, become four or another 32? What do you mean at runtime? Can I increase these two extras? When the GPU is running? without having to build a chat for another brand? You can't, you cannot do that. So if that fits? Yes. So the topology of blocks and threads and so on is defined at the start of the kernel? Yes. Okay. Yes, at the kernel launch you need to specify the number of blocks and the number of threads. That can be computed at runtime, so on the CPU when the application is running, but the configuration is fixed. So what's the point of asking for 34? Maybe isn't better pool with more than 34? Is better 64? It depends. It depends. Sometimes you need those two tries because of the parallelism pattern that you have in your application. You need those two tries. and you for example let's say we have a 34 by 34 matrix and you want to add 5 to all the elements you need those two threads extra threads otherwise you won't be able to do the computation currently or you need to you need to provide the application another thing another thing you can do in this case is to provide the application and see whether it's more convenient for you to have two words so 34 threads, or 32 that are going to repeat the computation another time. We'll see that, okay, but it depends on the, it maybe depends on the application of what you're doing and the realism that you want to expose. But in, for example, in the 34 by 34 case, I mean, if you want to theoretically achieve the slowest time, you might need those two extra terms. Sometimes it's not the best choice because of the fact that a lot of them are good. Sometimes you need them because of the parallel pattern that you have. So it depends on the case. A lot of stuff, depending on the application. The other thing is that, still on WORPS, this is the last thing I'm going to say about WORPS, is that since all the instructions are doing the same, all the threads in a work are doing the same stuff, what about branches? They are not feasible. You can do them, you can have branches, but these are going to require a bit more time with respect to doing the same instruction on all the threads, because what is done, the two branches are serialized. So for example we have the if branch that is going to be executed by a certain number of threads and then we have the else branch that is going to be computed by the other threads. If branch and else branch are not going to be overlapped in the execution. Maybe sometimes the compiler can do some kind of magic and interrupt the instructions, but it's not like they are completely overlapped because you cannot perform the same instruction on this index. Okay. So how does instruction fetch happen here? Like, sorry, this is... I'm curious because it looks like since all the threads within a word would be executing the same instruction, I imagine that a branch will have to have like a global result, at at least at the level of the work. Let's say a conditional branch would have to happen at the level of the work or block, I imagine. I'm not sure about your question, but maybe we can talk about that later and I can understand better. But most of the stuff, of this stuff is going to be optimized by the compiler itself. So also the interleaving of instructions is going to be an optimization that the compiler does. What you can do at the program level is try to find a way in order for all the tries in the world to compute the same instruction. So that they can catch the same instructions and perform it on different data. So for example, let's say we have some kind of divergence, we can use masks on threads. So let's say, for example, we can nullify intermediate results by multiplying them by and others, for example, are multiplied by another value, so that we can verify the results, but all the threads are with the same structure. This is just an example, the first thing that came to my mind. So, we have all the blocks, the blocks in the same way as the threads, are organized in a more complex structure, that is the grid. Again, the grid can be seen multi-dimensional, bi-dimensional or three-dimensional. So we can have a grid of two by three blocks, and as the tracks, all the blocks are ordered. So we can identify each block with its coordinates as the grid. the same goes for the text. Blocks in the grid are only dependent on each other and there is no synchronization of communication between them. Actually you can do again here some fancy stuff with the latest versions of CUDA, you can synchronize a lot of things but we are not going to see that because of that maybe. The grid in this case is, for example, bi-dimensional. The grid is going to be the core part of the kernel. So, for example, in a standard execution, we have our host application that is running. We have some kind of pre-processing of data, the data transfer to the device. device, we launch the kernel that is going to spawn blocks and then each block is going to spawn its own threads that are going to perform the computation and when the computation is complete, the control returns to the host and the host is going to do some kind of preprocessing depending on what we are interested in computing. The last important thing about this is that remember that host and device are the same for us. So this means that once I launch the kernel, if I do not wait for the GPU to complete the computation, then the host is going to go and do stuff also, and we cannot get the result back home from the GPU. As I said before, the global scheduler of the GPU is going to dispatch the blocks on the streaming multiprocessor. So each block is not going to be split across different SMs, but a single block lies on a single SM. More blocks can lie on the same streaming multiprocessor, of course, because the resources allow that, but the single block cannot be split. So once it's assigned to a single streaming multiprocessor, it's going to be on that streaming multiprocessor until the end of its own computation, its own lifetime. Once it is assigned to a semi-multiprocessor, we can call that block as active. Multiple blocks may be assigned to a single semi-multiprocessor. So in the picture is an example of a possible scenario that we can have on on the computation of the GPU. So let's say we have our CUDA program that has 8 blocks. Firstly, suppose that we have a GPU with two single multiprocessors, and that each block requires a single single multiprocessor. Let's say we have our one-to-one correspondence. one block of a semi-multiprocessor, one semi-multiprocessor, one block for ease of explanations. What we have here is going to be that for example, we are going to have block zero of semi-multiprocessor 0 and block one of semi-multiprocessor 1. And these are going to be computed until the end of the demo after, okay? And then we are going to have the other two blocks. Okay, so the GPU is actually computing two blocks at the same time. If we have more stream multiprocessors, supposing always that we have one block, of the extreme multiprocessor, we can compute more blocks in parallel. So the greater the number of extreme multiprocessors, the greater the number of blocks that we can process in parallel. So this is the idea. Another important thing is that blocks are executed in any order independently from each other. Blocks are independent, as I said before, we do not have any kind of synchronization across blocks. So the piece of computation that is carried out by block zero is not certain to be the first piece of computation that is going to be carried out. So for example, instead of having 0, 1, 2, 3, 4, 5, 6, 7, we might have had like block 7 and block 2. Okay, so they are independent and they can be scheduled independently and out of order, let's say. blocks are scheduled only if there is enough space for them to be scheduled so let's say each block requires 10 registers, this is for example a limiting factor each block requires 10 registers but our stream multiprocessors feature 25. What can we do? Nothing, because the GPU will only schedule two blocks, and that's it, on its own 10-mode processors. So we will have those five extra registers that maybe can be optimized and used in the blocks to store some data, but we do not have enough space for another block to be scheduled, then that block is not going to be scheduled. Okay? So, examples of limiting resources that we might have. So, we can have a maximum number of threads, so having more threads require more resources, and does not allow you to schedule multiple blocks in parallel on the same stream of the processor. We can have a maximum number of blocks, of course. The number of registers is another limiting factor, as I was saying. 24 registers allows you to schedule 2 blocks with 10 registers each. And also shared memory. Shared memory is the scratchpad memory that we have on the GPU, we will see that later. And it is part of the memories that are exposed to the programmer through the CUDA memory model. There are APIs and instructions that allow you to program this type of cache. and the amount of cache that we are going to use to reserve for each block is going to limit the number of blocks that we have on the streaming multiprocessor. Scheduling is optimized by the compiler itself. So the kernel is provided by the compiler in terms of the resource requirements and then a block is assigned to each streaming multiprocessor if there are enough resources. Of course, the resources are reserved for the block for its entire execution. So if the blocks require 10 registers, then those 10 registers are reserved for that specific block. And they are not going to be shared with other blocks. Okay. Okay, so as I was saying before, the streaming multiprocessor divides the block in active words are the groups of 32 threads. So this is going to be a very repetitive thing, but this is how GPU work is designed. So work, execute, instruction, the C and T approach. And we have interleaving of instructions coming from different words. As I was saying before, for example, during branches, they are not to be both executed by if branch and then else branch, but maybe we can have some kind of interleaving in the instructions. So some coming from the if branch, others coming from the else branch. But the compiler is able to optimize this kind of pattern. For example, if a worm is waiting for data, because maybe it's not available on the cache and it's going to be stored in... The data is in HBM, then we have to wait for the data to come back to the registers. Then what the GPU does is, okay, I'm not going to waste time, waste clock cycles in waiting for the data, for this information, I'm just going to schedule another instruction so that we can maybe, I don't know, the other work is waiting to do an addition, so maybe while we wait for the data for the first work, we can do the addition of the second. second. Hypothetically. If the block size is not normal for 32, inactive threads are added to the last world. This was the example of the 34 threads, 32 plus 32, only two active in the second world. So, instructions are scheduled depending on the work. We can classify works in terms of whether they are being executed or not. So we may have, for example, a selected work that is the work that we are executing. So we are executing instructions that are going to be performed by those 32 tracks. Stored work is a work, for example, that is not ready to be executed. So, for example, we are waiting for data. I'm mentioning waiting for data because it's one of the instructions that have the longest waiting time. So we will see that later. So we have for example a stored work. While we wait for the instruction, or we wait for the data, we wait for the operation to be complete, we can overlap the execution with other works, choosing them from eligible works. So the work scheduler is going to grab having structures coming from different worlds in order to keep the GPU busy. So the goal is to keep the GPU busy while we wait, for example, for data to come back from HBM or from whatever place it is. So some final considerations on the execution model. At the application level, all the threads are concurrent, but in practice it's not what is going to be actually done on the GPU. At the architecture level, all the blocks may be dispatched to different simulacry processors and different distances. So, as I was saying before, block 7 might be executed before block 0. This might happen. And the same is valid for warps. So it's not said that the first warp is going to be executed before the second warp. Okay. And the advantage is to have a simpler and more performant architecture. Because we are continuously computing data, we are continuously performing instruction that allows to keep the GPU busy and attain a high throughput. Okay. So, we need... The goal is to keep the GPU busy. We need to... In order to do that, we need to expose parallelism. There are two main levels of parallelism that we can exploit in GPUs. So there is task-level parallelism. So we have the same task. We have different tasks that are independent from each other. So we can compute them in parallel. This is task-level parallelism. Each task might be executing different instructions or working on certain pieces of data, and they may or may not be related to each other. OK, so I can get an example right now. But let's say we have, for example, I don't know, we want to perform in genomics, we want to perform, for example, in an alignment operation. So we want to find similarity between strings, between pairs of strings. So the idea is that, for example, across pairs, pairs are independent from each other. So this pair and this pair, they are separated in terms of dependency. So what can we do? We can compute them in parallel. This is, for example, an idea. The other level of parallelism is data-level parallelism, common in C and D, and in C and T also, when we perform the same operation on different pieces of data, a vector addition. Okay, so we want to sum two vectors, the classic example is the vector addition. So, for example, we have two vectors, vector A and vector B, result is vector C, the addition is performed element by element, so element twice addition. Each addition for each element is going to be independent from each other, we can do all the addition in parallel and reduce the entire computation time. Everything goes so far? Yes. Ok. So, now that we have a basic knowledge of the programming model, let's go to CUDA programs. Ok, a CUDA program is organized, well, resembles, at a high level view, the way in which the programming model is presented. The programming model is based on the fact that we have a host application in the kernel that is going to be executed on the GPU that has a certain degree of parallelism exploited by the blocks and the turrets. So, a CUDA program has in the application code, We have a sequential part that is going to be computed on the CPU, the OS application, and then we have the kernel that exploits a certain level of parallelism in order to be computed on the GPU in an efficient way. We can compute serial code on the GPU, yes, but it would be a waste. If we take the same piece of code and execute it on the CPU and on the GPU, the CPU might open its master. Because we are not taking advantage of the resources and the parallelism of the GPU. And this is going to be a limitation. So how can we, how the program is going to be executed? We have a specific flow, so we have our CPU and our GPU. So first we have to, we have a certain type of preprocessing, as before. We have our kernel. Before launching our kernel, we need to transfer the data from the CPU to the GPU, as I said We have separate memory spaces. They are very distinct also physically. We do not have the CPU and the GPU on the same die, except for having exceptions. But normally they are separated. So we need to transfer the data from the GPU to the GPU. Move it to the HPM, that is our point of connection. So in this case, let's say through the PCI bus, for example, we move the data to the RAM and the GPU and we can do this operation by calling a dedicated API. This is very nice of CUDA to provide us with an API that is going to take care of the data transfers for us. We do only need to provide the source memory location, the destination memory location, and the direction of the transfer, of course, and the size of the memory that we are transferring. So once we transfer the data, we can launch our kernel using this particular syntax in order to define the blocks and the threads. So our function is not going to be launched classically, our kernel does not have a classic C-like invocation, so we have the name of the function and the parentheses, but between the name and the parentheses we are going to have these syntax that we have to respect, indicating firstly the blocks, the number of blocks, and then we have the number of threads. So we specify at kernel launch the number of threads and the size of the grid, all the specifics of the grid. So this is the number of threads per block? This is the number of threads per block and this is the number of blocks in total. As I said, both the blocks and the grid also can be multidimensional, so we have a data type that is dedicated to that. But we will see it in a minute. Then we have all the data, we have the computation, but what we need to do is to synchronize the computation because we have to make sure that the GPU completed the computation and then we are sure to have all the results in the device memory that we are going to read in order to get the data. So once we synchronize, then we are going to be able to transfer the results back. Only once we are synchronized. Otherwise, we are not sure to have the results. This is because GPU and CPU are synchronous with respect to each other. Remember also when you are taking the time of the execution time of the kernel, you need to take the time after the synchronization. Otherwise it's going to be like way less than a millisecond. because this is the time that you need to launch the application. Then the processing is asynchronous, so you need to make sure that the GPU has finished before, like taking the time and say okay the GPU took like 10 seconds to complete the application. So this is the general execution flow. Let's see how we can perform a vector addition. This is the classical example in parallel programming. So when you try to do something of this kind, the first example is the vector addition. So this is a vector addition in C, C-like syntax. We have our vector sum, which is a for-root with the sum of the element-wise sum. We have our data, in the main function we have our data and our function location. That's all on this slide. We are on the same page. Perfect. Okay. So the idea, as I said before, is to exploit the fact that we are doing element-wise operations. So a0 plus b0 is going to give me c0. The operation done on index 0 is independent of the operation done on index 1. Make sense? So a1, b1 is going to give me c1 that is independent of c0 and independent of c2. So why don't we do that in parallel? So what can we do? We need to identify the parallel pattern. We identify that the operation, the sum of elementwise operation of the sum is independent across all the elements in the array. This is our parallel pattern. How can we explore the parallel pattern? Any idea? How can we parallelize this on the GPU? View every vector element on a different thread. Exactly. This is the simplest example. So we have independent operations that can be carried out by independent threads. The block size then depends on the resources that we have, on the size of the array. But let's say that we map the single sum operation one at a time. This is the key aspect. Okay? So how does this program change when we move to CUDA? Firstly, we need to include an header because of compatibility. We are going to have our kernel function instead of our vSAN function as before. We have the data as before. We have the data acquisition that we also had before, so we need to read them from a file or allow the user to enter it manually, I don't know, whatever. What do we need to do? As I said before, we need to allocate the memory, transfer the memory, transfer the data, launch the kernel and synchronize, move the data back and free the memory. And then we are going to visualize the data the same as before. So the kernel function is going to be our device code and this is going to be our host code. We have a main function that is always going to be executed on the CPU and then we are going to have our device function. So a CUDA program is a C, C++ program with CUDA extension. We have some restrictions on the device code. The device code only accesses GPU memory. So we cannot access data from the GPU that resides on the CPU RAM. We have two distinct memory spaces. The kernel is as a void return type. This is because we have two separate devices, two different memory spaces. We have the HPM that is our only channel of communication with the host. So we have to put the result in the HPM and then the CPU is going to read the data in the HPM. Okay? So the void retort type is asynchronous. We cannot have a variable number of parameters and we cannot have static variables and we cannot have function portals. The void return type is only restricted to the kernel that we are going to invoke in the host application. So if we want to call another function on the GPU, from the GPU, then we can, for example, make a return integer. But this is only allowed because it's the GPU calling a function from the GPU. But this function is not kernel, right? Yes, it's still a part of the kernel. It's a part of the kernel but it's not another kernel. It depends. It might be the case, yes, actually. So, let's see how is the kernel written. So, it has a void return type, we sum the kernel's just-chip name, we add the pointers to the memory, and then we add some fancy syntax. Let's see what this type means. So, remember that we are following a single instruction of multiple data, multiple threads, SP and D, let's call it the way you prefer, but they all mean kind of the same thing. So we have a kernel function that works on the single data element, the kernel function includes the code executed by a thread, And what we're going to do is parallelize the for loop that we had before. Okay? The for loop is replaced by a grid of threads, each one working on a single data element. This was what we said before, right? Okay, so... How do we map the thread to the single data element? We need to use the indices of the threads. So we said, for example, that we could map the sum of zero index elements in the array to the thread zero. This is the logical way to do it, right? So what we can do is to identify the thread by using thread index. Dot x means the coordinate x of the thread, because we can have multidimensional blocks. We identify the thread, but we need to identify the position that we want to sum. So we need, for example, let's say, just going back to this, alright, Let's say that we have blocks with two threads. The first block, for the sake of simplicity, the first block is going to compute c0 and c1. Then what about c2? We have the first block, the block 0, and then the block 1 that is going to compute c2 and c3. We need to identify that for the thread 0 in block 1, that thread is going to compute the element 2. But we do not have any 1 or 2 that is immediately available to us by using either the block index or the trend index. So what we can do is take advantage of the fact that we know, because we decided, the number of trends that each block has. So what can we do is say, okay, the first block, block 0, is going to start from 0. Block 1 is going to be shifted from the beginning of 2 Make sense? Ok. Block 1 of 2. Block 2 is going to be shaped in 4. That is the ID of the block multiplied by the block size. Make sense? So this is exactly what we are doing here. we have our index, thread 0. We have to compute the index 2, as before. Thread 0 is going to be our thread. Block index is 1, because it's block 1. times block D. Block D is 2, so that we have 2 plus 0. So the thread 0 in block 1 is going to compute index 2. We have to tune the indices that we have in our data structures in order to map them to thread, or the other way around. Okay, this is an example. Once we have identified our index, then we only have to sum. Like as we said before, each trite is going to only sum the elements corresponding to the assigned index. So, a2 plus b2 is going to give us c2. Okay, yes. So, if I understand correctly, I is essentially getting configured and saved to realistically register of the single rank while C and A and B in this case would be on the scratchpad and on the memory. I was going to say that. Sorry. Okay, so this is a local variable that is private to the director. Okay, so this means that probably this is going to be stored in a register or something like that because it's private. I'm trade I and I only know which is the value of this variable. because it's the one that I want to use. A and B are the inputs. And since there's R, we said that we have different memory spaces, so we can have a We are sure that A, B and C are pointers related to device memory. Because we know this is the entry point of the kernel. So we know for sure that all the pointers that we have in the entry point of the kernel are pointers related to device memory, HPM. Because we need to gather the data that is sent to the CPU, and then we must make sure, we have to make sure that we are going to send the data back to the CPU. So we need to have a space that is dedicated to results in the HPM and we are going to write the data into the HPM. Okay? So we have the mapping of the thread to the single data element, we elaborate the data element and we write the result back into the HPM. We read, we sum and we write back. So block index is the identifier of the block in the grid, the block beam is the size of the block. The thread IDX is the ID of the thread in the block. And then we have a grid DIN that is the size of the grid. This is an additional one that I listed, but you can use it. Maybe you want to, I don't know, do the same kind of pattern identification as we did for the vector, but you need the dimensionality of the grid. Sometimes you may need that. Or maybe you want to identify if you are at the border of the grids. Sometimes it may be useful. Block IDX, Block Dim, Trader IDX, and Grid Dim are built-in variables. And they are a data structure that has three fields, x, y, because we can have multidimensional blocks and grids. So, as we were saying, this is a local variable to the thread, and these are pointers to the device memory, because these are inputs and this is the result. All the other stuff might be on the GPU, and we are not sure that it's going to be on the HVM, it might be on the scratchpad depending on how we work the code. Remember, Austin devices have separate scopes, this is why we need pointers to device memory. Local variables are private, and the memory are accessed by all the threads. So, this means that all the threads have this pointer available. In fact, we can do this type of access. We have another additional thing that is the qualifier before the return time. It's always placed before the return time. We can add three types of qualifiers. Device, global and post. Why did I say before that we were sure that this was the entry point of the kernel? Because of the global qualifier. The global qualifier identifies a function that is going to be executed on the device, but is callable only from the host. All the kernels that we call from the host have a global qualifier. Then we may have another function that is called within the global kernel. For example, if we wanted to take advantage of our function doing the sum, if we want to do that we can. If we are sure that this function is going to be called only by the GPU, we can use the device qualifier. So in this case we would have device int, for example, sum, int a int b, and then return c, return a plus b. In the same way, we can have functions that are going to be called exclusively by the host, and then they also qualify. These two are the same thing but executed on different devices. Are they non-DAS review or I can have them? Yes. So if I want to have the hard function both in running the CPU and the GPU I have to write... Well, you can... So, the host qualifier is not mandatory, so if you have a function without any qualifier, that is going to be a host function. The GPU cannot call that function. If you have a function, for example the sum function, that can be called by both the host and the device, Then you can put both the qualifiers. You can put like underscore, underscore, host, underscore, underscore, device, while keeping the same syntax. But you can put them together, one next to each other, so that the compiler knows that when doing the passes, because the compiler actually does two passes, one to compile the host application and one to compile the device application. So if the compiler finds both the qualifiers, then it's going to compile the function twice. So you can use them in combination. I don't think you can put the global qualifier together with the other two. I don't think you can do that. But yes, the other two you can of course use a single function and that can be It's not going to be the same in the application physically, but you can compile it twice. The compiler will compile it twice. So, going a bit faster, I will try to go a bit faster. So first we need to allocate the memory. We know that they have different memory spaces. The device memory is separate from the host. So we need dedicated pointers that are going to be used only referred to HPM. So we need, for example, in this case we have the D underscore pointers that are referred to hdmi only. We can use them only for to the mallocs, that are mallocs on the device, and in the current code. These are the only two places that we can use, where we can use the underscore, the pointers related to the device memory. Otherwise the compiler will give you an error. So you know you are doing something wrong because the compiler is going to tell you what are you Of course, we need the pointer and, as in classical mallocs, the amount of memory in bytes that we want to activate. Then we have to transfer the data, and this is going to be done explicitly. It's not like in C function where you're doing both the functions, so you have the local variables that are going to be allocated, but since they are not in the same memory space, we need to move the data. We need to move the data, we have destination memory, source memory, the size of the data that we want to transfer, and the direction. Okay? So we have, for example, this is going to be host to device. This means that we are transferring data toward the GPU. We can have, for example, device to host when we want to retrieve the results, for example. And we can also have device to device because we can, maybe, if we are interested to, we can exchange data across GPUs. Okay? As I was saying before, we have this particular built-in structure that allows us to define this three-dimensional variable that allows us to specify the number of blocks per grid. So this is going to be monodimensional because this is one. And the transfer block, for example, here says 256. Again, one dimensional. This means that we are dividing the workload that is n, so n was the size of the array, across, in blocks of 20, 256 elements. So the size was like, yes, 1 times 24, but we are going to divide it in groups of 256. And each block is going to process those 256 elements. OK, then we have the kernel call. We have this syntax here with the greater than, less than. To launch the kernel, using the configuration that we have here, as I said before, the function parameters are the pointers to device memory. I can, of course I can send data by copy. For example, if I want to transfer an integer to the GPU, I can do that, without using any pointer. By copy I can remove the data, but I cannot do anything by portrait and reference. for array and for arrays in matrices I need space allocated on the GPU. All good? Yeah. Okay. Another way of writing the same thing, if we have 1D kernel launches, we can specify only the first dimension without using the DIMM3 data structure. Okay, once we have the kernel launched, we need to synchronize first and then we can do the memcpy. In this case, we can do the memcpy without synchronizing because the memcpy is blocking. So this means that once we call the... all the instructions to the GPU go into a queue. A queue of instructions that the GPU is going to execute. The mem copies are blocking for the host. This means once the host will wait for the completion of the instructions, this means that the mem copy is going to act only when the GPU completes because the transfer is from the device. It's in the queue for the GPU, but the GPU is computing, so the GPU cannot perform this instruction. But this means also that the host is waiting. So there are a series of APIs that are blocking, other are not blocking. Codemap copies are blocking. So the CPU, once it gets here, is not going to move forward. It's going to wait for the completion of the operation. Again, data transfer are explicit and we have all the possible directions. and well, yes, you have also the copy host to host, I forgot that. But the most important are the device to host and the host to device, of course. Lastly, you have APIs for clean memory on the GPU. Maybe to reuse that and you need all different allocations, you can use memory clean, for example. Sometimes you can use the device reset, but it's unlikely. This is the entire code. With respect to a classical C program, we have a very big portion of code that we have to add. But thanks to the PUD API we can do it quite easily. So we have demalloc, dmemcopy, the kernel call, dmemcopy and the free. Of course, always we have to allocate, copy the data, then we can call the kernel, copy back the data and free the kernel. This is the classical structure of a kernel. Then here you can do all the fancy stuff that you want, but the classic structure of the host application is this one. You always have a portion for allocation, a portion for name copy, the kernel launch, the copy-back, and then the tree. How do we compile CUDA code? As we compile C code, the only thing is that we have to make sure to use MVCC, that is the CUDA compiler. And the extension of our file is going to be C or Cp, but we have C or CUDA. This is the only different thing. You can put all the compiler flags that you want, some are recommended, others not, but you find them in the manual, to execute .slash the name of the application and all the arguments that your application requires, as we do in C code. Last two or three things I don't recall. Each API call, so every function that has CUDA before the beginning of the claim, so CUDA-made copy, CUDA-free, they call the return a flag that is used for error handling. This is the classic macro that is going to be used to catch the error rather than trying can actually have this macro that is going to be put around the API code. In this way, if this operation throws an error, there are a variety of errors that this API can throw. So for example, one of the errors is invalid pointer, because maybe you are using, I don't know, an os pointer and you cannot do that, or maybe it's an sufficient memory device, might be a scenario. So this macro is able to catch the error and then it's going to paint you the error and exit the program. Nice to have, and a lot of times it's helpful to have the macros that allows you to catch exceptions or anything that is going on in your program. In the same way, there is a macro that allows you to catch errors at kernel launch. So, for example, you can catch, for example, segmentation faults. It might happen, so maybe you, in calling to that error, this macro is able to catch it. Or for example, when you schedule too many threads within a block, it might happen, there is a limit that is 1024. So if you schedule 2048, this marker is going to help you catch the error. Instead of the GPU just saying I'm not doing anything, I'm not going to compute that because it's way beyond my limits. list this allows you to like understand what is going wrong last i don't know how many okay not to make sorry um so there are device apis that allows you to query device information we have a c-structure that allows you to catch all the information for a given device. So you query the device and then you get all the properties. And this is to catch the ID of the device because you might have a multi-GPU system. So you can get the identifier and then get the property of the device. for example that is the shared memory per block, the maximum number of registers, the memory per single multiprocessor, there are a lot of fields in this part. And you can use them also in your application, for example if you want to allocate all the possible memory that you have on the GPU, you can do that. And you get this information from this API. So for example you get the major and the minor revision numbers, the name of the device, the total level of memory. Also maximum number of threads, number of registers per blocks, so a lot of stuff that you can see the specific fields in the CUDA library. You can do that also by command line. There is this nice tool that is in BDSMI. For example, this returns data about the GPU with the identifier 0. It's also nice to see, For example, it gives you the CUDA version. I don't think it gives you the compute capability, but you can see which program is running on the GPU. So if you want to use GPU 1 and is busy doing something, computing something else, you can switch to the other. And yes, I got it. These are the credits, so I mainly took the slides from these two manuals, some are from the course held by Professor Miele, and some of the material comes from the NVIDIA GPU teaching kit. Thanks for your attention, and if you have any questions, I have a few minutes left for answering.