-
Notifications
You must be signed in to change notification settings - Fork 344
Introduce ARM Neon and SSE2 SIMD. #743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The gain seem to be 7% on real word benchmarks:
Also note that I did one more refactoring to make the introduction of SIMD easier, so you still have a conflict. |
ext/json/ext/generator/simd.h
Outdated
uint8x16x4_t load_uint8x16_4(const unsigned char *table, int offset) { | ||
uint8x16x4_t tab; | ||
for(int i=0; i<4; i++) { | ||
tab.val[i] = vld1q_u8(table+offset+(i*16)); | ||
} | ||
return tab; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that just vld4q_u8
?
https://developer.arm.com/architectures/instruction-sets/intrinsics/vld4q_u8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it's not. vld4q_u8
interleaves the data among the 4 vector registers.
% cat load-test.c
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
void print_vec(char *msg, uint8x16_t vec) {
printf("%s\n[ ", msg);
uint8_t store[16] = {0};
vst1q_u8(store, vec);
for(int i=0; i<16; i++) {
printf("%3d ", store[i]);
}
printf("]\n");
}
uint8x16x4_t load_table(uint8_t *table, int offset) {
uint8x16x4_t tab;
for(int i=0; i<4; i++) {
tab.val[i] = vld1q_u8(table+offset+(i*16));
}
return tab;
}
int main(void) {
uint8_t table[256];
for(int i=0; i<256; i++) {
table[i] = i;
}
uint8x16x4_t tab1 = load_table(table, 0);
print_vec("tab1.val[0]", tab1.val[0]);
print_vec("tab1.val[1]", tab1.val[1]);
print_vec("tab1.val[2]", tab1.val[2]);
print_vec("tab1.val[3]", tab1.val[3]);
printf("\n");
uint8x16x4_t tab1_2 = vld4q_u8(table);
print_vec("tab1_2.val[0]", tab1_2.val[0]);
print_vec("tab1_2.val[1]", tab1_2.val[1]);
print_vec("tab1_2.val[2]", tab1_2.val[2]);
print_vec("tab1_2.val[3]", tab1_2.val[3]);
return 0;
}
% ./load-test
tab1.val[0]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]
tab1.val[1]
[ 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ]
tab1.val[2]
[ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ]
tab1.val[3]
[ 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ]
tab1_2.val[0]
[ 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 ]
tab1_2.val[1]
[ 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 ]
tab1_2.val[2]
[ 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 ]
tab1_2.val[3]
[ 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's so weird.
Well, maybe that loop should be unrolled then, I suspect the compiler does it, but might as well be explicit.
Can you just include the implementation for the regular escaping? I'm not sure the script safe version is quite worth it. |
…tion. Also store the potential matches directly rather than looking up values in the escape table.
ext/json/ext/generator/generator.c
Outdated
if ((ch_len = search_escape_basic_neon_advance_lut(search)) != 0) { | ||
return ch_len; | ||
} | ||
|
||
// if ((ch_len = search_escape_basic_neon_advance_rules(search)) != 0) { | ||
// return ch_len; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's a toss up which one is the best. It might be an artifact that my M1 Macbook Air is passively cooled and it gets warm after I run it over and over.
Comparison between
Running it a second time:
|
…e only need 128 bytes for the lookup table as the top 128 bytes are all zeros.
Not sure why but it's way more modest on my machine (Air M3):
|
Apologies for going dark for a while. I've been trying to make incremental improvements on a different branch (found here). My hope was using a move mask would be faster than Feel free to try it out though. |
That's no worries at all. I want to release a After that I think I can start merging some SIMD stuff. I'd like to go with the smaller possible useful SIMD acceleration to ensure it doesn't cause issues with people. If it works well, we can then go farther. So yeah, no rush. |
@byroot if you have a few minutes, would you be able to checkout this branch and benchmark it against master. You'll have to tweak your compare script a bit to compile this branch with This branch uses the bit twiddling sort of platform agnostic SIMD code if the SIMD code is disabled via a The results on my M1:
|
With that compilation flag and compared to
|
From a co-worker with an M4 Pro:
|
From another co-worker with an M1 Pro:
|
I just pushed a PR #769 to this repo which also employs SIMD to speed up string escapes. I am really sorry that we both worked in that area at the same time; after I started my work I didn't check back with this repo for a while (and I should have done that.) I believe the main difference between my PR and yours seem that mine supports x86 as well. It is doing this by using a cross-platform shim I want to suggest to collaborate on getting SIMD support in one way or another. 👋 |
Hi @radiospiel, I'll take a look at #769. I originally started working on #730 which supports Neon, SSE 4.2 and AVX2 with runtime detection support. The PR got a bit big so I decided to close it and implement each instruction set individually. Additionally, @byroot refactored the code quite a bit to make the SIMD implementation quite a bit easier. There are two implementations in this PR, one uses a lookup table and the other is rule-based. Both seem to have similar performance on my machine. On my machine I see a 11%-48% improvement depending on the benchmark. A few of my co-workers saw various speedups depending on their machine. I should probably mark this PR as "Ready for Review". However, I'm happy to collaborate either on this or your PR. Edit: oh yeah, there is an old-school bit-twiddling SIMD approach in pure C: #738 |
Thank you, @samyron . I became painfully aware of the work you did when I tried to merge master into my branch, because the interface's of the escape functions had been changed; my implementation relies on a "escape me a The main difference between your approach and mine is that you switch out the search functionality, depending on the availability of SIMD, while I switch out the SIMD primitives instead. This allows me to have working implementations for X86, ARM, and bit-twiddling; but only a handful of primitives are available because NEON and AVX are different, so your approach should allow for per-hardware type optimal implementations. I have a busy week ahead of me, but I will definitively take a look end of the week. I will also benchmark on Graviton instances; most ARM server workloads are probably not on a Apple Silicon CPU after all :) Happy to benchmark this PR as well. Can you share a benchmark script that produces the most useful output for you? I would be especially interested in understanding how you get the "before" and "after" entries in the benchmark output :) Speaking of benchmarks:
This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the |
Apologies, yes, that was a typo. I'll fix it in the comment above |
@samyron I reran benchmarks (link). Both our PRs show a substantial improvement over the baseline, the only significant difference is on short strings.
strings.short is a test on a 13-byte string I believe such short strings are relevant, because JSON object keys are probably quite often shorter than 16 byte; my PR applies SIMD for strings of 8 byte and more (link). (The value of 8 seemed beneficial and looked nice, but I should probably retest this with smaller values.) Maybe you could be able to support that as well? |
@byroot we have two competing implementations of the same approach. While mine is probably more beneficial in the short term (because it also supports x86), I believe that @samyron 's approach has more future potential, because it allows handcrafted SIMD implementations that are fundamentally different between NEON and SSE2. (and it certainly can be extended to also support shorter strings, see comment above.) Also, transplanting a x86 implementation from my PR into @samyron 's shouldn't be too hard to achieve. I see the following alternatives:
What do you all think about that? ☝️ |
#ifdef ENABLE_SIMD | ||
|
||
#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(__aarch64__) || defined(_M_ARM64) | ||
#include <arm_neon.h> |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
42744f6
to
51635ad
Compare
51635ad
to
c999baf
Compare
Alright. I think it looks good to me. I've pushed some small simplification for NEON which I'd like your opinion on. If you think it's OK then Ineed to do the same change for SSE2, otherwise we revert back to checking Other than that I fixed a few typos and added a CI job that disable SIMD. Once we're settled on that pocount thing, I'll cleanup the git history and merge. |
The I'm good either way, I'm not sure if we should really focus on optimizing for synthetic worst-case benchmarks. I'm just trying to avoid any case where this the SIMD-code performs worse than the scalar implementation. |
ext/json/ext/generator/generator.c
Outdated
uint8x16_t has_dblquote = vceqq_u8(chunk, dblquote); | ||
uint8x16_t needs_escape = vorrq_u8(too_low, vorrq_u8(has_backslash, has_dblquote)); | ||
|
||
vandq_u8(needs_escape, vdupq_n_u8(0x1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the popcount
based code, this line isn't necessary.
I pushed a commit to simplify updating |
Additionally, if we decide to not us the |
That makes sense. I also don't think we should optimize for the worst case scenario at the expense of code simplicity. I can't really imagine it's common to need this much escaping in a 16 characters sequence. As long as the macro-benchmark don't regress, I'm happy. |
d41c593
to
142dce7
Compare
Alright I have removed the One thing I wonder now is, do we really need that |
Nevermind, I understand now. We need to have that info when we call back into |
142dce7
to
e50b5df
Compare
I refactored But this can wait. I'm satisified with the current PR, let me know if you don't have anything else to add either. |
What the hell? Why does |
It's failing on master too now. I suspect GitHub updated the compiler or something -_-. |
I have nothing more to add at this time. Anything additional in the future can be a follow up. |
Thank both for all the work on this. Now I'll try to update json in |
(ruby/json#743) See the pull request for the long development history: ruby/json#743 ``` == Encoding activitypub.json (52595 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 2.913k i/100ms Calculating ------------------------------------- after 29.377k (± 2.0%) i/s (34.04 μs/i) - 148.563k in 5.059169s Comparison: before: 23314.1 i/s after: 29377.3 i/s - 1.26x faster == Encoding citm_catalog.json (500298 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 152.000 i/100ms Calculating ------------------------------------- after 1.569k (± 0.8%) i/s (637.49 μs/i) - 7.904k in 5.039001s Comparison: before: 1485.6 i/s after: 1568.7 i/s - 1.06x faster == Encoding twitter.json (466906 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 309.000 i/100ms Calculating ------------------------------------- after 3.115k (± 3.1%) i/s (321.01 μs/i) - 15.759k in 5.063776s Comparison: before: 2508.3 i/s after: 3115.2 i/s - 1.24x faster ``` ruby/json@49003523da
Awesome. Thank you for seeing this through. I know it took a while with some very messy PRs. I'm happy to jump back in and fix issues and/or take this further with different implementations in the future. There is an additional AVX2 implementation in the original PR that I can get re-implement within the new searching code. Additionally, there is an SSE4.2 instruction that may also be useful pcmpestri. |
Sure.
I don't know how many implementation it's really worth to have. I think it makes sense to have SS2 as the baseline the overwhelming majority of x86-64 CPUs will have, and then probably another one that is in a sweet spot between efficiency and availability. e.g. not sure it's worth doing AVX-512 given it's not even in some newly released CPUs. So can probably include a SSE4.2 implementation, or an AVX-2 implementations, but both wouldn't be worth it I think. |
Ok, so it broke at least the WASM CI and the i686 one: ruby/ruby#13194 I'll see what I can do about it. |
(ruby/json#743) See the pull request for the long development history: ruby/json#743 ``` == Encoding activitypub.json (52595 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 2.913k i/100ms Calculating ------------------------------------- after 29.377k (± 2.0%) i/s (34.04 μs/i) - 148.563k in 5.059169s Comparison: before: 23314.1 i/s after: 29377.3 i/s - 1.26x faster == Encoding citm_catalog.json (500298 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 152.000 i/100ms Calculating ------------------------------------- after 1.569k (± 0.8%) i/s (637.49 μs/i) - 7.904k in 5.039001s Comparison: before: 1485.6 i/s after: 1568.7 i/s - 1.06x faster == Encoding twitter.json (466906 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 309.000 i/100ms Calculating ------------------------------------- after 3.115k (± 3.1%) i/s (321.01 μs/i) - 15.759k in 5.063776s Comparison: before: 2508.3 i/s after: 3115.2 i/s - 1.24x faster ``` ruby/json@49003523da
(ruby/json#743) See the pull request for the long development history: ruby/json#743 ``` == Encoding activitypub.json (52595 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 2.913k i/100ms Calculating ------------------------------------- after 29.377k (± 2.0%) i/s (34.04 μs/i) - 148.563k in 5.059169s Comparison: before: 23314.1 i/s after: 29377.3 i/s - 1.26x faster == Encoding citm_catalog.json (500298 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 152.000 i/100ms Calculating ------------------------------------- after 1.569k (± 0.8%) i/s (637.49 μs/i) - 7.904k in 5.039001s Comparison: before: 1485.6 i/s after: 1568.7 i/s - 1.06x faster == Encoding twitter.json (466906 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 309.000 i/100ms Calculating ------------------------------------- after 3.115k (± 3.1%) i/s (321.01 μs/i) - 15.759k in 5.063776s Comparison: before: 2508.3 i/s after: 3115.2 i/s - 1.24x faster ``` ruby/json@49003523da
(ruby/json#743) See the pull request for the long development history: ruby/json#743 ``` == Encoding activitypub.json (52595 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 2.913k i/100ms Calculating ------------------------------------- after 29.377k (± 2.0%) i/s (34.04 μs/i) - 148.563k in 5.059169s Comparison: before: 23314.1 i/s after: 29377.3 i/s - 1.26x faster == Encoding citm_catalog.json (500298 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 152.000 i/100ms Calculating ------------------------------------- after 1.569k (± 0.8%) i/s (637.49 μs/i) - 7.904k in 5.039001s Comparison: before: 1485.6 i/s after: 1568.7 i/s - 1.06x faster == Encoding twitter.json (466906 bytes) ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24] Warming up -------------------------------------- after 309.000 i/100ms Calculating ------------------------------------- after 3.115k (± 3.1%) i/s (321.01 μs/i) - 15.759k in 5.063776s Comparison: before: 2508.3 i/s after: 3115.2 i/s - 1.24x faster ``` ruby/json@49003523da
Version 2 of the introduction of ARM Neon SIMD.
There are currently two implementations:
Benchmarks (Lookup table)
Benchmarks (Rules based)
I am still working on this but I wanted to share progress.
Edit: Looks like I missed one commit so I'll have to resolve some merge conflicts.