-
Notifications
You must be signed in to change notification settings - Fork 216
/
Copy pathvadd.cpp
120 lines (109 loc) · 5.28 KB
/
vadd.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
/*******************************************************************************
Description:
This example uses the load/compute/store coding style, which is generally
the most efficient for implementing kernels using HLS. The load and store
functions are responsible for moving data in and out of the kernel as
efficiently as possible. The core functionality is decomposed across one
of more compute functions. Whenever possible, the compute function should
pass data through HLS streams and should contain a single set of nested loops.
HLS stream objects are used to pass data between producer and consumer
functions. Stream read and write operations have a blocking behavior which
allows consumers and producers to synchronize with each other automatically.
The dataflow pragma instructs the compiler to enable task-level pipelining.
This is required for to load/compute/store functions to execute in a parallel
and pipelined manner.
The kernel operates on vectors of NUM_WORDS integers modeled using the hls::vector
data type. This datatype provides intuitive support for parallelism and
fits well the vector-add computation. The vector length is set to NUM_WORDS
since NUM_WORDS integers amount to a total of 64 bytes, which is the maximum size of
a kernel port. It is a good practice to match the compute bandwidth to the I/O
bandwidth. Here the kernel loads, computes and stores NUM_WORDS integer values per
clock cycle and is implemented as below:
_____________
| |<----- Input Vector 1 from Global Memory
| load_input | __
|_____________|----->| |
_____________ | | in1_stream
Input Vector 2 from Global Memory --->| | |__|
__ | load_input | |
| |<---|_____________| |
in2_stream | | _____________ |
|__|--->| |<--------
| compute_add | __
|_____________|---->| |
______________ | | out_stream
| |<---|__|
| store_result |
|______________|-----> Output result to Global Memory
*******************************************************************************/
#include <stdint.h>
#include <hls_stream.h>
#define DATA_SIZE 4096
// TRIPCOUNT identifier
const int c_size = DATA_SIZE;
static void read_input(unsigned int* in, hls::stream<unsigned int>& inStream, int size) {
// Auto-pipeline is going to apply pipeline to this loop
mem_rd:
for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
inStream << in[i];
}
}
static void compute_add(hls::stream<unsigned int>& inStream1,
hls::stream<unsigned int>& inStream2,
hls::stream<unsigned int>& outStream,
int size) {
// Auto-pipeline is going to apply pipeline to this loop
execute:
for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
outStream << (inStream1.read() + inStream2.read());
}
}
static void write_result(unsigned int* out, hls::stream<unsigned int>& outStream, int size) {
// Auto-pipeline is going to apply pipeline to this loop
mem_wr:
for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
out[i] = outStream.read();
}
}
extern "C" {
/*
Vector Addition Kernel Implementation using dataflow
Arguments:
in1 (input) --> Input Vector 1
in2 (input) --> Input Vector 2
out (output) --> Output Vector
size (input) --> Size of Vector in Integer
*/
void vadd(unsigned int* in1, unsigned int* in2, unsigned int* out, int size) {
static hls::stream<unsigned int> inStream1("input_stream_1");
static hls::stream<unsigned int> inStream2("input_stream_2");
static hls::stream<unsigned int> outStream("output_stream");
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0
#pragma HLS dataflow
// dataflow pragma instruct compiler to run following three APIs in parallel
read_input(in1, inStream1, size);
read_input(in2, inStream2, size);
compute_add(inStream1, inStream2, outStream, size);
write_result(out, outStream, size);
}
}