Skip to content

Update Iterator::eq implementation for better performance #44729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bluss opened this issue Sep 20, 2017 · 12 comments
Closed

Update Iterator::eq implementation for better performance #44729

bluss opened this issue Sep 20, 2017 · 12 comments
Labels
C-bug Category: This is a bug. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@bluss
Copy link
Member

bluss commented Sep 20, 2017

The itertools equal function expresses the same functionality as Iterator::eq in a way that compiles better / performs better. Import their implementation into libstd, also see if other iterator comparison functions can be improved.

The problem could be that the libstd code is using match on a tuple of options.

@bluss bluss added C-bug Category: This is a bug. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Sep 20, 2017
@bluss
Copy link
Member Author

bluss commented Sep 20, 2017

The itertools code is not the end station of course, there could be improvements on top of that.

@mister-walter
Copy link

I'd like to take a shot at this - I'll go check out Gitter when I have time later today (I'm currently on JST)

@lezgomatt
Copy link
Contributor

hey @mister-walter, are you still interested in working on this? If not, I'd like to give this issue a try 😃

@bluss
Copy link
Member Author

bluss commented Oct 1, 2017

Feel free to find me on irc or here

@mister-walter
Copy link

mister-walter commented Oct 1, 2017 via email

@lezgomatt
Copy link
Contributor

lezgomatt commented Oct 1, 2017

Awesome, thanks!

I tried a few approaches out and it seems that the matching on the tuples does indeed cause slower code to be generated (best on some quick benchmarking).

The current implementation does the ff:

loop {
    match (self.next(), other.next()) {
        (None, None) => return true,
        (None, _) | (_, None) => return false,
        (Some(x), Some(y)) => if x != y { return false },
    }
}

Modifying the code to separate the matching resulted in some nice gains:

loop {
    match self.next() {
        None => return other.next().is_none(),
        Some(x) => match other.next() {
            None => return false,
            Some(y) => if x != y { return false },
        },
    };
}

Alternatively (to reduce nesting):

loop {
    let x = match self.next() {
        None => return other.next().is_none(),
        Some(val) => val,
    };

    let y = match other.next() {
        None => return false,
        Some(val) => val,
    };

    if x != y {
        return false;
    }
}

I haven't taken a good look at the MIR generated though. I also tested it out on the other comparison functions and noticed similar gains. I'll try to take a deeper look at the code generated soon.

@bluss
Copy link
Member Author

bluss commented Oct 1, 2017

Oh nice, the new versions are side-effect compatible as well right? self.next() and other.next() are called exactly as many times as before.

@lezgomatt
Copy link
Contributor

Yes, that should be correct. They should be equivalent.

I tried comparing the assembly code generated (with the function monomorphized for the iterator of an integer vector and array, inlining of the functions disabled, and optimized for release) and noticed that that the code generated for the non-tuple matches[1] were, as expected, identical. On the other hand, the tuple match generated something a bit more convoluted. I didn't bother checking the MIR anymore since I don't think it does any optimizations there right now that would affect it.

Do you have any suggestions on further investigating this issue? Or any tests you want me to try? If not, I'll be submitting a PR to manually optimize the tuple matching on the implementations of the following functions: cmp, partial_cmp, eq, ne, lt, le, gt, ge. Also, which implementation do you think is preferable, the shorter but more nested version or the longer but less nested version?

[1] This includes the later two code blocks I posted in the earlier comment, as well as the while-let and if-let version originally posted.

@mdevlamynck
Copy link
Contributor

I was following this a bit, should an issue be open to investigate why there is such a difference in performance when using match on a tuple vs using nested matches ? It was at least 2x slower on my modest (and maybe not that representative) benchmarks to match on tuples.

@lezgomatt
Copy link
Contributor

Yeah, I would've expected it to optimize into the same thing. I'm not sure if it's an issue with rustc or with llvm though. Looking into the assembly code didn't really provide me with any clues on what's preventing the compiler from further optimizing the code either (tbh, I'm not very experienced with assembly though).

@andjo403
Copy link
Contributor

andjo403 commented Oct 3, 2017

when looking at the mir code from https://godbolt.org/g/W8tiyS it feels like rustc can generate mir code that is easier to optimize. maybe it is possible to remove the packing/unpacking of data in to the tuple it looks like the biggest difference in the mir code.

@lezgomatt
Copy link
Contributor

lezgomatt commented Oct 7, 2017

According to the MIR RFC, it seems like the packing is intentionally done: https://github.com/nox/rust-rfcs/blob/master/text/1211-mir.md#aggregates-and-further-lowering. Interestingly enough, even in a simpler case, the compiler would produce less efficient code when matching on tuples: https://godbolt.org/g/jvcLuQ.

Looking at the MIR generated, aside from the tuple packing, we might be able to optimize branches that may lead to a branch marked unreachable to simply go to the reachable branch instead. I doubt that will actually improve things, but I think it's still worth a shot.

bors added a commit that referenced this issue Oct 12, 2017
Optimize comparison functions of Iterator

Replaced matching on tuples which led to less performant code generation. Testing on microbenchmarks consistently showed ~1.35x improvement in performance on my machine.

Fixes #44729.
Centril added a commit to Centril/rust that referenced this issue Apr 2, 2019
…scottmcm

Remove duplicated code from Iterator::{ne, lt, le, gt, ge}

This PR delegates `Iterator::ne` to `Iterator::eq` and `Iterator::{lt, le, gt, ge}` to `Iterator::partial_cmp`.

Oddly enough, this change actually simplifies the generated assembly [in some cases](https://rust.godbolt.org/z/riBtNe), although I don't understand assembly well enough to see if the longer assembly is doing something clever.

I also added two extremely simple benchmarks:
```
// before
test iter::bench_lt               ... bench:      98,404 ns/iter (+/- 21,008)
test iter::bench_partial_cmp      ... bench:      62,437 ns/iter (+/- 5,009)

// after
test iter::bench_lt               ... bench:      61,757 ns/iter (+/- 8,770)
test iter::bench_partial_cmp      ... bench:      62,151 ns/iter (+/- 13,753)
```

I have no idea why the current `lt`/`le`/`gt`/`ge` implementations don't seem to be compiled optimally, but simply having them call `partial_cmp` seems to be an improvement.

See rust-lang#44729 for a previous discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants