Skip to content

[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 16, 2024

Conversation

MacDue
Copy link
Member

@MacDue MacDue commented Dec 12, 2024

Instead, allow any alignment >= the element size (in bytes). This is all that is needed for (predicated) vector loads even if unaligned accesses are disabled.

See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3

Specifically:

// Check alignment on size of element accessed, not overall access size.
constant integer alignment = if accdesc.ispair then size DIV 2 else size;

The size passed to Mem by SVE load/store instructions is the element size.

@llvmbot
Copy link
Member

llvmbot commented Dec 12, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

Instead, allow any alignment >= the element size (in bytes). This is all that is needed for vector loads even if unaligned accesses are disabled.

See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3

Specifically:

// Check alignment on size of element accessed, not overall access size.
constant integer alignment = if accdesc.ispair then size DIV 2 else size;

The size passed to Mem by SVE load/store instructions is the element size.


Full diff: https://github.com/llvm/llvm-project/pull/119732.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+10)
  • (added) llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll (+64)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index e61dedb2477560..ee882dbae06537 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2569,6 +2569,16 @@ MVT AArch64TargetLowering::getScalarShiftAmountTy(const DataLayout &DL,
 bool AArch64TargetLowering::allowsMisalignedMemoryAccesses(
     EVT VT, unsigned AddrSpace, Align Alignment, MachineMemOperand::Flags Flags,
     unsigned *Fast) const {
+
+  // Allow SVE loads/stores where the alignment >= the size of the element type,
+  // even with +strict-align. The SVE loads/stores do not require memory to be
+  // aligned more than the element type even without unaligned accesses.
+  // Without already aligned loads/stores are forced to have 16-byte alignment,
+  // which is unnecessary and fails to build as TLI.expandUnalignedLoad() and
+  // TLI.expandUnalignedStore() don't yet support scalable vectors.
+  if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8))
+      return true;
+
   if (Subtarget->requiresStrictAlign())
     return false;
 
diff --git a/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
new file mode 100644
index 00000000000000..d9c7ef5ddc88c0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.b
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0]
+; CHECK-NEXT:    st1b { z0.b }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 16 x i8>, ptr %ldptr, align 1
+  store <vscale x 16 x i8> %l3, ptr %stptr, align 1
+  ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 8 x i16>, ptr %ldptr, align 2
+  store <vscale x 8 x i16> %l3, ptr %stptr, align 2
+  ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    st1w { z0.s }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 4 x i32>, ptr %ldptr, align 4
+  store <vscale x 4 x i32> %l3, ptr %stptr, align 4
+  ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    st1d { z0.d }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 8
+  store <vscale x 2 x i64> %l3, ptr %stptr, align 8
+  ret void
+}
+
+; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE.
+; define void @unaligned_nxv2i64(ptr %ldptr, ptr %stptr) {
+; ; CHECK-LABEL: nxv2i64:
+; ; CHECK:       // %bb.0:
+; ; CHECK-NEXT:    ptrue p0.d
+; ; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; ; CHECK-NEXT:    st1d { z0.d }, p0, [x1]
+; ; CHECK-NEXT:    ret
+;   %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 4
+;   store <vscale x 2 x i64> %l3, ptr %stptr, align 4
+;   ret void
+; }

Copy link

github-actions bot commented Dec 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@MacDue MacDue force-pushed the fix_sve_loads_strict_align branch 2 times, most recently from 163fe1d to 3efd02d Compare December 12, 2024 20:12
…trict-align

Instead, allow any alignment >= the element size (in bytes). This is all
that is needed for vector loads even if unaligned accesses are disabled.

See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3

Specifically:
```
// Check alignment on size of element accessed, not overall access size.
constant integer alignment = if accdesc.ispair then size DIV 2 else size;
```

The `size` passed to `Mem` by SVE load/store instructions is the
element size.
@MacDue MacDue force-pushed the fix_sve_loads_strict_align branch from 3efd02d to 4b7fce2 Compare December 12, 2024 20:16
Comment on lines 2574 to 2579
// even with +strict-align. The SVE loads/stores do not require memory to be
// aligned more than the element type even without unaligned accesses.
// Without this, already aligned loads and stores are forced to have 16-byte
// alignment, which is unnecessary and fails to build as
// TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support
// scalable vectors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on the comment; this is true for SVE's ld1/st1 instruction, but not for SVE str/ldr as those require the address to be 16-byte aligned (for data vectors, and 2-byte aligned for predicate vectors). So there is an assumption here that a store of <vscale x 4 x i32> ends up using st1, which is true in practice if the store comes from the IR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten the comment. Is it safe to assume the str z* is only used for spills/fills (and possibly via an intrinsic, which I don't think uses these alignment checks)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scalable data-vectors ldr/str is only ever used for spills and fills. For predicate regs, there is only str/ldr and so their addresses must be at least 2-byte aligned.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid do anything to prevent an expanded use of ldr/str though. I know I've toyed with it in the past as a way to remove predicate data dependencies.

Not sure if this is relevant but they could also come in via inline asm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this prevents an expanded use of ldr/str for operations that are known to be 16-byte aligned. For operations with < 16-byte alignment any expansion in expandUnalignedLoad/Store would negate any benefit of using ldr/str over predicated loads/stores over anyway.

I don't think inline asm is relevant here, I think these hooks are mainly used things like StoreSDNode/LoadSDNode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great news. Thanks for the update.

// alignment, which is unnecessary and fails to build as
// TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support
// scalable vectors.
if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align(VT.getScalarSizeInBits() / 8) will fail an assert when VT < MVT::i8 (like a predicate MVT::i1), so this would fail when the +strict-align feature is not set. Could you add a test for this case?

Not caused by your patch, but I am surprised to see LLVM actually generate regular loads/stores when the alignment is smaller than the element size, e.g. load <4 x i32>, ptr %ptr, align 1 when the +strict-align flag is not set at all. Is this a bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a bug. Looking at AArch64.UnalignedAccessFaults() (used as part of the FP loads): https://developer.arm.com/documentation/ddi0602/2023-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#AArch64.UnalignedAccessFaults.3 it looks like unaligned accesses are supported (depending on the configuration).

Also the langref states:

The optional constant align argument specifies the alignment of the operation (that is, the alignment of the memory address). It is the responsibility of the code emitter to ensure that the alignment information is correct.

; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s

; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE.
; XFAIL: *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than XFAILing this test, could you add a not --crash for the second RUN line, and a regular FileChecked output for the first RUN line?

@MacDue MacDue merged commit 3b17d04 into llvm:main Dec 16, 2024
8 checks passed
@MacDue MacDue deleted the fix_sve_loads_strict_align branch December 16, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants