[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

MacDue · 2024-12-12T18:09:49Z

Instead, allow any alignment >= the element size (in bytes). This is all that is needed for (predicated) vector loads even if unaligned accesses are disabled.

See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3

Specifically:

// Check alignment on size of element accessed, not overall access size.
constant integer alignment = if accdesc.ispair then size DIV 2 else size;

The size passed to Mem by SVE load/store instructions is the element size.

llvmbot · 2024-12-12T18:10:24Z

@llvm/pr-subscribers-backend-aarch64

Author: Benjamin Maxwell (MacDue)

Changes

Instead, allow any alignment >= the element size (in bytes). This is all that is needed for vector loads even if unaligned accesses are disabled.

See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3

Specifically:

// Check alignment on size of element accessed, not overall access size.
constant integer alignment = if accdesc.ispair then size DIV 2 else size;

The size passed to Mem by SVE load/store instructions is the element size.

Full diff: https://github.com/llvm/llvm-project/pull/119732.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+10)
(added) llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll (+64)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index e61dedb2477560..ee882dbae06537 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2569,6 +2569,16 @@ MVT AArch64TargetLowering::getScalarShiftAmountTy(const DataLayout &DL,
 bool AArch64TargetLowering::allowsMisalignedMemoryAccesses(
     EVT VT, unsigned AddrSpace, Align Alignment, MachineMemOperand::Flags Flags,
     unsigned *Fast) const {
+
+  // Allow SVE loads/stores where the alignment >= the size of the element type,
+  // even with +strict-align. The SVE loads/stores do not require memory to be
+  // aligned more than the element type even without unaligned accesses.
+  // Without already aligned loads/stores are forced to have 16-byte alignment,
+  // which is unnecessary and fails to build as TLI.expandUnalignedLoad() and
+  // TLI.expandUnalignedStore() don't yet support scalable vectors.
+  if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8))
+      return true;
+
   if (Subtarget->requiresStrictAlign())
     return false;
 
diff --git a/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
new file mode 100644
index 00000000000000..d9c7ef5ddc88c0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll
@@ -0,0 +1,64 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s
+
+define void @nxv16i8(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.b
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0]
+; CHECK-NEXT:    st1b { z0.b }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 16 x i8>, ptr %ldptr, align 1
+  store <vscale x 16 x i8> %l3, ptr %stptr, align 1
+  ret void
+}
+
+define void @nxv8i16(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 8 x i16>, ptr %ldptr, align 2
+  store <vscale x 8 x i16> %l3, ptr %stptr, align 2
+  ret void
+}
+
+define void @nxv4i32(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    st1w { z0.s }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 4 x i32>, ptr %ldptr, align 4
+  store <vscale x 4 x i32> %l3, ptr %stptr, align 4
+  ret void
+}
+
+define void @nxv2i64(ptr %ldptr, ptr %stptr) {
+; CHECK-LABEL: nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.d
+; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    st1d { z0.d }, p0, [x1]
+; CHECK-NEXT:    ret
+  %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 8
+  store <vscale x 2 x i64> %l3, ptr %stptr, align 8
+  ret void
+}
+
+; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE.
+; define void @unaligned_nxv2i64(ptr %ldptr, ptr %stptr) {
+; ; CHECK-LABEL: nxv2i64:
+; ; CHECK:       // %bb.0:
+; ; CHECK-NEXT:    ptrue p0.d
+; ; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
+; ; CHECK-NEXT:    st1d { z0.d }, p0, [x1]
+; ; CHECK-NEXT:    ret
+;   %l3 = load <vscale x 2 x i64>, ptr %ldptr, align 4
+;   store <vscale x 2 x i64> %l3, ptr %stptr, align 4
+;   ret void
+; }

github-actions · 2024-12-12T18:13:24Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…trict-align Instead, allow any alignment >= the element size (in bytes). This is all that is needed for vector loads even if unaligned accesses are disabled. See: https://developer.arm.com/documentation/ddi0602/2024-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#impl-aarch64.Mem.read.3 Specifically: ``` // Check alignment on size of element accessed, not overall access size. constant integer alignment = if accdesc.ispair then size DIV 2 else size; ``` The `size` passed to `Mem` by SVE load/store instructions is the element size.

sdesmalen-arm · 2024-12-13T09:57:04Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // even with +strict-align. The SVE loads/stores do not require memory to be
+  // aligned more than the element type even without unaligned accesses.
+  // Without this, already aligned loads and stores are forced to have 16-byte
+  // alignment, which is unnecessary and fails to build as
+  // TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support
+  // scalable vectors.


nit on the comment; this is true for SVE's ld1/st1 instruction, but not for SVE str/ldr as those require the address to be 16-byte aligned (for data vectors, and 2-byte aligned for predicate vectors). So there is an assumption here that a store of <vscale x 4 x i32> ends up using st1, which is true in practice if the store comes from the IR.

Rewritten the comment. Is it safe to assume the str z* is only used for spills/fills (and possibly via an intrinsic, which I don't think uses these alignment checks)?

For scalable data-vectors ldr/str is only ever used for spills and fills. For predicate regs, there is only str/ldr and so their addresses must be at least 2-byte aligned.

We should avoid do anything to prevent an expanded use of ldr/str though. I know I've toyed with it in the past as a way to remove predicate data dependencies.

Not sure if this is relevant but they could also come in via inline asm.

I don't think this prevents an expanded use of ldr/str for operations that are known to be 16-byte aligned. For operations with < 16-byte alignment any expansion in expandUnalignedLoad/Store would negate any benefit of using ldr/str over predicated loads/stores over anyway.

I don't think inline asm is relevant here, I think these hooks are mainly used things like StoreSDNode/LoadSDNode.

Great news. Thanks for the update.

llvm/test/CodeGen/AArch64/sve-load-store-strict-align.ll

sdesmalen-arm · 2024-12-13T10:27:39Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  // alignment, which is unnecessary and fails to build as
+  // TLI.expandUnalignedLoad() and TLI.expandUnalignedStore() don't yet support
+  // scalable vectors.
+  if (VT.isScalableVector() && Alignment >= Align(VT.getScalarSizeInBits() / 8))


Align(VT.getScalarSizeInBits() / 8) will fail an assert when VT < MVT::i8 (like a predicate MVT::i1), so this would fail when the +strict-align feature is not set. Could you add a test for this case?

Not caused by your patch, but I am surprised to see LLVM actually generate regular loads/stores when the alignment is smaller than the element size, e.g. load <4 x i32>, ptr %ptr, align 1 when the +strict-align flag is not set at all. Is this a bug?

I don't think it's a bug. Looking at AArch64.UnalignedAccessFaults() (used as part of the FP loads): https://developer.arm.com/documentation/ddi0602/2023-09/Shared-Pseudocode/aarch64-functions-memory?lang=en#AArch64.UnalignedAccessFaults.3 it looks like unaligned accesses are supported (depending on the configuration).

Also the langref states:

The optional constant align argument specifies the alignment of the operation (that is, the alignment of the memory address). It is the responsibility of the code emitter to ensure that the alignment information is correct.

sdesmalen-arm · 2024-12-13T14:33:00Z

llvm/test/CodeGen/AArch64/sve-unaligned-load-store-strict-align.ll

+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve,+strict-align < %s | FileCheck %s
+
+; FIXME: Support TLI.expandUnalignedLoad()/TLI.expandUnalignedStore() for SVE.
+; XFAIL: *


Rather than XFAILing this test, could you add a not --crash for the second RUN line, and a regular FileChecked output for the first RUN line?

MacDue requested review from ostannard, paulwalker-arm and sdesmalen-arm December 12, 2024 18:09

llvmbot added the backend:AArch64 label Dec 12, 2024

MacDue force-pushed the fix_sve_loads_strict_align branch 2 times, most recently from 163fe1d to 3efd02d Compare December 12, 2024 20:12

MacDue force-pushed the fix_sve_loads_strict_align branch from 3efd02d to 4b7fce2 Compare December 12, 2024 20:16

sdesmalen-arm reviewed Dec 13, 2024

View reviewed changes

Fixups

5065257

sdesmalen-arm reviewed Dec 13, 2024

View reviewed changes

Fixups

890c93b

sdesmalen-arm approved these changes Dec 13, 2024

View reviewed changes

MacDue merged commit 3b17d04 into llvm:main Dec 16, 2024
8 checks passed

MacDue deleted the fix_sve_loads_strict_align branch December 16, 2024 10:23

davemgreen mentioned this pull request Feb 21, 2025

[AArch64][SVE] Lower unpredicated loads/stores as LDR/STR. #127837

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

MacDue commented Dec 12, 2024 •

edited

Loading

llvmbot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024 •

edited

Loading

sdesmalen-arm Dec 13, 2024

MacDue Dec 13, 2024

sdesmalen-arm Dec 13, 2024

paulwalker-arm Dec 13, 2024 •

edited

Loading

MacDue Dec 13, 2024

paulwalker-arm Dec 13, 2024

sdesmalen-arm Dec 13, 2024

MacDue Dec 13, 2024

sdesmalen-arm Dec 13, 2024

[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

[AArch64][SVE] Don't require 16-byte aligned SVE loads/stores with +strict-align #119732

Conversation

MacDue commented Dec 12, 2024 • edited Loading

llvmbot commented Dec 12, 2024

github-actions bot commented Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulwalker-arm Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MacDue commented Dec 12, 2024 •

edited

Loading

github-actions bot commented Dec 12, 2024 •

edited

Loading

paulwalker-arm Dec 13, 2024 •

edited

Loading