-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Benjamin Maxwell (MacDue) ChangesInstead, allow any alignment >= the element size (in bytes). This is all that is needed for vector loads even if unaligned accesses are disabled. Specifically:
The Full diff: https://github.com/llvm/llvm-project/pull/119732.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index e61dedb2477560..ee882dbae06537 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2569,6 +2569,16 @@ MVT AArch64TargetLowering::getScalarShiftAmountTy(const DataLayout &DL,
bool AArch64TargetLowering::allowsMisalignedMemoryAccesses(
EVT VT, unsigned AddrSpace, Align Alignment, MachineMemOperand::Flags Flags,
unsigned *Fast) const {
+
+ // Allow SVE loads/stores where the alignment >= the size of the element type,
+ // even with +strict-align. The SVE loads/stores do not require memory to be
+ // aligned more than the element type even without unaligned accesses.
+ // Without already aligned loads/stores are forced to have 16-byte alignment,
+ // which is unnecessary and fails to build as TLI.expandUnalignedLoad() and
+ // TLI.expandUnalignedStore() don't yet support scalable vectors.
+ if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8))
+ return true;
+
if (Subtarget->requiresStrictAlign())
return false;
diff --git a/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
new file mode 100644
index 00000000000000..d9c7ef5ddc88c0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
+; CHECK-NEXT: st1b { z0.b }, p0, [x1]
+; CHECK-NEXT: ret
+ %l3 = load <vscale x 16 x i8>, ptr %ldptr, align 1
+ store <vscale x 16 x i8> %l3, ptr %stptr, align 1
+ ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT: st1h { z0.h }, p0, [x1]
+; CHECK-NEXT: ret
+ %l3 = load <vscale x 8 x i16>, ptr %ldptr, align 2
+ store <vscale x 8 x i16> %l3, ptr %stptr, align 2
+ ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT: st1w { z0.s }, p0, [x1]
+; CHECK-NEXT: ret
+ %l3 = load <vscale x 4 x i32>, ptr %ldptr, align 4
+ store <vscale x 4 x i32> %l3, ptr %stptr, align 4
+ ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT: st1d { z0.d }, p0, [x1]
+; CHECK-NEXT: ret
+ %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 8
+ store <vscale x 2 x i64> %l3, ptr %stptr, align 8
+ ret void
+}
+
+; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE.
+; define void @unaligned_nxv2i64(ptr %ldptr, ptr %stptr) {
+; ; CHECK-LABEL: nxv2i64:
+; ; CHECK: // %bb.0:
+; ; CHECK-NEXT: ptrue p0.d
+; ; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
+; ; CHECK-NEXT: st1d { z0.d }, p0, [x1]
+; ; CHECK-NEXT: ret
+; %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 4
+; store <vscale x 2 x i64> %l3, ptr %stptr, align 4
+; ret void
+; }
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
163fe1d
to
3efd02d
Compare
…trict-align Instead, allow any alignment >= the element size (in bytes). This is all that is needed for vector loads even if unaligned accesses are disabled. See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3 Specifically: ``` // Check alignment on size of element accessed, not overall access size. constant integer alignment = if accdesc.ispair then size DIV 2 else size; ``` The `size` passed to `Mem` by SVE load/store instructions is the element size.
3efd02d
to
4b7fce2
Compare
// even with +strict-align. The SVE loads/stores do not require memory to be | ||
// aligned more than the element type even without unaligned accesses. | ||
// Without this, already aligned loads and stores are forced to have 16-byte | ||
// alignment, which is unnecessary and fails to build as | ||
// TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support | ||
// scalable vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit on the comment; this is true for SVE's ld1/st1 instruction, but not for SVE str/ldr as those require the address to be 16-byte aligned (for data vectors, and 2-byte aligned for predicate vectors). So there is an assumption here that a store of <vscale x 4 x i32>
ends up using st1
, which is true in practice if the store comes from the IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten the comment. Is it safe to assume the str z*
is only used for spills/fills (and possibly via an intrinsic, which I don't think uses these alignment checks)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For scalable data-vectors ldr/str is only ever used for spills and fills. For predicate regs, there is only str/ldr and so their addresses must be at least 2-byte aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid do anything to prevent an expanded use of ldr/str though. I know I've toyed with it in the past as a way to remove predicate data dependencies.
Not sure if this is relevant but they could also come in via inline asm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this prevents an expanded use of ldr/str for operations that are known to be 16-byte aligned. For operations with < 16-byte alignment any expansion in expandUnalignedLoad/Store would negate any benefit of using ldr/str over predicated loads/stores over anyway.
I don't think inline asm is relevant here, I think these hooks are mainly used things like StoreSDNode
/LoadSDNode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great news. Thanks for the update.
// alignment, which is unnecessary and fails to build as | ||
// TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support | ||
// scalable vectors. | ||
if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align(VT.getScalarSizeInBits() / 8)
will fail an assert when VT < MVT::i8
(like a predicate MVT::i1
), so this would fail when the +strict-align feature is not set. Could you add a test for this case?
Not caused by your patch, but I am surprised to see LLVM actually generate regular loads/stores when the alignment is smaller than the element size, e.g. load <4 x i32>, ptr %ptr, align 1
when the +strict-align
flag is not set at all. Is this a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a bug. Looking at AArch64.UnalignedAccessFaults()
(used as part of the FP loads): https://developer.arm.com/documentation/ddi0602/2023-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#AArch64.UnalignedAccessFaults.3 it looks like unaligned accesses are supported (depending on the configuration).
Also the langref states:
The optional constant align argument specifies the alignment of the operation (that is, the alignment of the memory address). It is the responsibility of the code emitter to ensure that the alignment information is correct.
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s | ||
|
||
; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE. | ||
; XFAIL: * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than XFAILing this test, could you add a not --crash
for the second RUN line, and a regular FileChecked output for the first RUN line?
Instead, allow any alignment >= the element size (in bytes). This is all that is needed for (predicated) vector loads even if unaligned accesses are disabled.
See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3
Specifically:
The
size
passed toMem
by SVE load/store instructions is the element size.