Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix cross dependent path #5

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

andreaskuster
Copy link
Collaborator

@andreaskuster andreaskuster commented Sep 11, 2021

  • check ComputeGraph: setup_internal_buffers
  • check KernelChainGraph: compute_kernel_latency
  • check KernelChainGraph: compute_delay_buffer
  • apply persistent fix & remove manual fix for horidiff

@andreaskuster andreaskuster self-assigned this Sep 11, 2021
@andreaskuster
Copy link
Collaborator Author

andreaskuster commented Sep 11, 2021

@definelicht

Are the op_latency values in stencilflow/compute_graph.config accurate or do we have to make them dynamic i.e. as a function of the device type or IP block we invoke?

Furthermore, we model the pipeline latency as 1 elem/cycle (push/pop). Is this accurate in all cases?

@andreaskuster andreaskuster requested review from definelicht and removed request for definelicht September 11, 2021 19:56
@definelicht
Copy link
Contributor

In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)

Copy link
Contributor

@definelicht definelicht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you want me to review here? I could only find some debugging code

@andreaskuster
Copy link
Collaborator Author

In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

andreaskuster and others added 2 commits September 13, 2021 20:34
Co-authored-by: definelicht <definelicht@inf.ethz.ch>
Co-authored-by: definelicht <definelicht@inf.ethz.ch>
@andreaskuster
Copy link
Collaborator Author

What did you want me to review here? I could only find some debugging code

Sorry for that, it is still WIP, and pushed the review button by mistake, and removed it immediately, but I guess the email notificaiton has already been sent out.

@definelicht
Copy link
Contributor

definelicht commented Sep 14, 2021

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.

Adjust delay buffer computation.
@andreaskuster
Copy link
Collaborator Author

@definelicht FYI: The dace submodule link from master is invalid i.e. 404

@definelicht
Copy link
Contributor

@definelicht FYI: The dace submodule link from master is invalid i.e. 404

Fixed

@andreaskuster
Copy link
Collaborator Author

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

@definelicht
Copy link
Contributor

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?

@andreaskuster
Copy link
Collaborator Author

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?

Yep, that makes sense

@andreaskuster
Copy link
Collaborator Author

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

signal-2021-09-14-145557_001

@definelicht
Copy link
Contributor

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

The size of the delay buffer should have no influence on when the data arrives from c to k3. These are just FIFOs, so they can be read early. k3 will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival at k3?

@andreaskuster
Copy link
Collaborator Author

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

The size of the delay buffer should have no influence on when the data arrives from c to k3. These are just FIFOs, so they can be read early. k3 will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival at k3?

Ok yes you are right, but we are basically wasting memory.

@definelicht
Copy link
Contributor

Ok yes you are right, but we are basically wasting memory

Yep, this is the "safe" solution. Keep in mind that the buffer sizes from this will be tiny compared to buffering a slice of the 2D domain, so I'm not concerned about this being a significant factor :-)

@andreaskuster
Copy link
Collaborator Author

Compiling reference SDFG...
Loading input arrays...
Initializing output arrays...
Executing DaCe program...
Finished running program.
Executing reference DaCe program...
Finished running program.
Results saved to results/horidiff
Reference results saved to results/horidiff/reference
Comparing to reference SDFG...
Results verified.
bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc  47.84s user 11.63s system 214% cpu 27.689 total

It seems to be fixed, please reproduce this finding before we merge :)

@andreaskuster
Copy link
Collaborator Author

Compiling reference SDFG...
Loading input arrays...
Initializing output arrays...
Executing DaCe program...
Finished running program.
Executing reference DaCe program...
Finished running program.
Results saved to results/horidiff
Reference results saved to results/horidiff/reference
Comparing to reference SDFG...
Results verified.
bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc  47.84s user 11.63s system 214% cpu 27.689 total

It seems to be fixed, please reproduce this finding before we merge :)

For min channel depth = 1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants