Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ml): rocm #16613

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

feat(ml): rocm #16613

wants to merge 24 commits into from

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Mar 5, 2025

Description

This PR introduces support for AMD GPUs through ROCm. It's a rebased version of #11063 with updated dependencies.

It also once again removes algo caching, as the concurrency issue with caching seems to be more subtle than originally thought. While disabling caching is wasteful (it essentially runs a benchmark every time instead of only once), it's still better than the current alternative of either lowering concurrency to 1 or not having ROCm support.

Zelnes and others added 7 commits March 5, 2025 09:37
use 3.12

use 1.19.2
guard algo benchmark results

mark mutex as mutable

re-add /bin/sh (?)

use 3.10

use 6.1.2
1.19.2

fix variable name

fix variable reference

aaaaaaaaaaaaaaaaaaaa
@mertalev mertalev requested a review from bo0tzz as a code owner March 5, 2025 14:46
@github-actions github-actions bot added documentation Improvements or additions to documentation 🧠machine-learning labels Mar 5, 2025
@mertalev mertalev added changelog:feature and removed documentation Improvements or additions to documentation labels Mar 5, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 5, 2025
Copy link
Contributor

github-actions bot commented Mar 5, 2025

📖 Documentation deployed to pr-16613.preview.immich.app

steps:
- name: Login to GitHub Container Registry
Copy link
Member

@NicholasFlamy NicholasFlamy Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some changes in indentation as well as changes from double quote to single quote. Was this intended? I know it's from the first commit from the original PR but I don't think that was addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS Code did this when I saved. I'm not sure why it's different

Copy link
Member

@NicholasFlamy NicholasFlamy Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a PR check that runs prettier on the workflow files? I would think the inconsistency exists because there likely isn't.


WORKDIR /code

RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half
Copy link
Member

@NicholasFlamy NicholasFlamy Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half
RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx-dev

Only migraphx-dev is needed as the other 2 are dependencies.

Edit: don't change it now, though, because it's already building.

@@ -80,11 +111,14 @@ COPY --from=builder-armnn \
/opt/ann/build.sh \
/opt/armnn/

FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS prod-rocm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there were already comments on this, but I think copying the deps manually may result in a smaller, yet still working image. It might be worth re-investigating.

@@ -15,6 +15,34 @@ RUN mkdir /opt/armnn && \
cd /opt/ann && \
sh build.sh

# Warning: 25GiB+ disk space required to pull this image
# TODO: find a way to reduce the image size
FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS builder-rocm

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Not it.

@przemekbialek
Copy link

They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either. I'm not sure what the exact steps are to get gfx1103 in ROCm but I do know it requires a custom build/version of ROCm. And while as you said, AMD's stance is "it may work but we won't help you out", it does not mean it will work without this custom ROCm build.

Edit: So my question would be, how does one check what's supported by the build they are running?

I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check:

$ ls /usr/lib64/rocm/
gfx10  gfx11  gfx1100  gfx1103  gfx8  gfx9  gfx90a  gfx942

Fedora rocBLAS patch for gfx1103 support looks like copy of gfx1102 (navi33). Only names and ISA versions differ. I diffed changes betwen few files and think that theese are only diferences.

-- phoenix
-- gfx1103
-- [Device 1586]
+- navi33
+- gfx1102
+- [Device 73f0]
 - AllowNoFreeDims: false
   AssignedDerivedParameters: true
   Batched: true
@@ -112,7 +112,7 @@
     GroupLoadStore: false
     GuaranteeNoPartialA: false
     GuaranteeNoPartialB: false
-    ISA: [11, 0, 3]
+    ISA: [11, 0, 2]

I'm intrested in additional gpu support because I have minipc with Ryzen8845HS (Radeon 780M) for testing, and second one with Ryzen5825U.
I tried running ghcr.io/immich-app/immich-machine-learning:pr-16613-rocm version with HSA_OVERRIDE_GFX_VERSION=11.0.0, but this setup crashes my card under heavy load (only default models from immich works and only when I run one type of job in single thread). I read that for 780M best choice is gfx1102 but when I set HSA_OVERRIDE_GFX_VERSION=11.0.2 I have errors. I think its because onnxruntime doesn't have compiled support for this arch. Now I trying to build machine-learning with rocm onnxruntime support with small patch which I think enables gfx900 and gfx1102 support in onnxruntime, so if and when build completes I will try this.

diff --git a/cmake/CMakeLists.txt b/cmake/CMakeLists.txt
index d90a2a355..bb1a7de12 100644
--- a/cmake/CMakeLists.txt
+++ b/cmake/CMakeLists.txt
@@ -295,7 +295,7 @@ if (onnxruntime_USE_ROCM)
   endif()

   if (NOT CMAKE_HIP_ARCHITECTURES)
-    set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942;gfx1200;gfx1201")
+    set(CMAKE_HIP_ARCHITECTURES "gfx900;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102;gfx940;gfx941;gfx942;gfx1200;gfx1201")
   endif()

   file(GLOB rocm_cmake_components ${onnxruntime_ROCM_HOME}/lib/cmake/*)

@SharkWipf
Copy link

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

@NicholasFlamy
Copy link
Member

HSA_OVERRIDE_GFX_VERSION=11.0.2

This is not a valid version from what I've observed. So far, there are only 3 valid options:

HSA_OVERRIDE_GFX_VERSION=11.0.0
HSA_OVERRIDE_GFX_VERSION=10.3.0
HSA_OVERRIDE_GFX_VERSION=9.0.0

@przemekbialek
Copy link

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

Unfortunately adding support for gfx1102 dosen't solve problems with crashing on Radeon 780M, but I'm happy because I succeeded getting it to work on Ryzen 5825U GPU.

@NicholasFlamy
Copy link
Member

Radeon 780M

They also specifically say certain iGPUs crash. I would bet that they're just bleading edge.

Ryzen 5825U GPU

That model or similar is known to work.

@przemekbialek
Copy link

przemekbialek commented Mar 9, 2025

HSA_OVERRIDE_GFX_VERSION=11.0.2

This is not a valid version from what I've observed. So far, there are only 3 valid options:

HSA_OVERRIDE_GFX_VERSION=11.0.0
HSA_OVERRIDE_GFX_VERSION=10.3.0
HSA_OVERRIDE_GFX_VERSION=9.0.0

ROCm which is in image created in this PR has compiled for arch which are below so 11.0.2 is valid option because this means gfx1102. Below some direcrory listing from image.

-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1010.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1012.dat
-rw-r--r-- 1 root root     23026 Dec 11 10:06 TensileLibrary_lazy_gfx1030.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1100.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1101.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1102.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1151.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1200.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1201.dat
-rw-r--r-- 1 root root     26537 Dec 11 10:06 TensileLibrary_lazy_gfx900.dat
-rw-r--r-- 1 root root     31798 Dec 11 10:06 TensileLibrary_lazy_gfx906.dat
-rw-r--r-- 1 root root     34732 Dec 11 10:06 TensileLibrary_lazy_gfx908.dat
-rw-r--r-- 1 root root     62265 Dec 11 10:06 TensileLibrary_lazy_gfx90a.dat
-rw-r--r-- 1 root root     58949 Dec 11 10:06 TensileLibrary_lazy_gfx942.dat

Without patch to onxruntime HSA_OVERRIDE_GFX_VERSION=9.0.0 isn't a valid option in immich-machine-learning because this arch isn't compiled by default.
By default onnx runtime builds for arch:

set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942;gfx1200;gfx1201")

@przemekbialek
Copy link

I created my image from commit 1c550fa with following changes:

diff --git a/machine-learning/Dockerfile b/machine-learning/Dockerfile
index f3b7c60c4..253430cac 100644
--- a/machine-learning/Dockerfile
+++ b/machine-learning/Dockerfile
@@ -1,4 +1,4 @@
-ARG DEVICE=cpu
+ARG DEVICE=rocm

 FROM python:3.11-bookworm@sha256:68a8863d0625f42d47e0684f33ca02f19d6094ef859a8af237aaf645195ed477 AS builder-cpu

@@ -36,10 +36,12 @@ WORKDIR /code/onnxruntime
 # TODO: find a way to fix this without disabling algo caching
 COPY ./patches/0001-disable-rocm-conv-algo-caching.patch /tmp/
 RUN git apply /tmp/0001-disable-rocm-conv-algo-caching.patch
+COPY ./patches/onnxruntime_add_gfx900_gfx1102.patch /tmp/
+RUN git apply /tmp/onnxruntime_add_gfx900_gfx1102.patch

 RUN /bin/sh ./dockerfiles/scripts/install_common_deps.sh
 # Note: the `parallel` setting uses a substantial amount of RAM
-RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 17 --cmake_extra_defines\
+RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 16 --cmake_extra_defines\
     ONNXRUNTIME_VERSION=1.20.1 --use_rocm --rocm_home=/opt/rocm
 RUN mv /code/onnxruntime/build/Linux/Release/dist/*.whl /opt/

diff --git a/machine-learning/patches/onnxruntime_add_gfx900_gfx1102.patch b/machine-learning/patches/onnxruntime_add_gfx900_gfx1102.patch
new file mode 100644
index 000000000..81bbdb3d6
--- /dev/null
+++ b/machine-learning/patches/onnxruntime_add_gfx900_gfx1102.patch
@@ -0,0 +1,13 @@
+diff --git a/cmake/CMakeLists.txt b/cmake/CMakeLists.txt
+index d90a2a355..bb1a7de12 100644
+--- a/cmake/CMakeLists.txt
++++ b/cmake/CMakeLists.txt
+@@ -295,7 +295,7 @@ if (onnxruntime_USE_ROCM)
+   endif()
+
+   if (NOT CMAKE_HIP_ARCHITECTURES)
+-    set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942;gfx1200;gfx1201")
++    set(CMAKE_HIP_ARCHITECTURES "gfx900;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102;gfx940;gfx941;gfx942;gfx1200;gfx1201")
+   endif()
+
+   file(GLOB rocm_cmake_components ${onnxruntime_ROCM_HOME}/lib/cmake/*)

Propably better way is to set CMAKE_HIP_ARCHITECTURES in Dockerfile but I didn't know how to do it so I made it brute force ;)

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Mar 9, 2025

Alright, so I learned something. Some gfx versions such as gfx1031 (RX 6700 XT, my GPU) are bundles with other versions such as gfx1030 while others are not. I had thought there were only 3 bundles, but now I know there are more.

This PR should have these versions as of ghcr.io/immich-app/immich-machine-learning:commit-ded3bbb033615385a2339e27171b6ab682a31df9-rocm:

-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1010.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1012.dat
-rw-r--r-- 1 root root     23026 Dec 11 10:06 TensileLibrary_lazy_gfx1030.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1100.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1101.dat
-rw-r--r-- 1 root root     24186 Dec 11 10:06 TensileLibrary_lazy_gfx1102.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1151.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1200.dat
-rw-r--r-- 1 root root     17653 Dec 11 10:06 TensileLibrary_lazy_gfx1201.dat
-rw-r--r-- 1 root root     26537 Dec 11 10:06 TensileLibrary_lazy_gfx900.dat
-rw-r--r-- 1 root root     31798 Dec 11 10:06 TensileLibrary_lazy_gfx906.dat
-rw-r--r-- 1 root root     34732 Dec 11 10:06 TensileLibrary_lazy_gfx908.dat
-rw-r--r-- 1 root root     62265 Dec 11 10:06 TensileLibrary_lazy_gfx90a.dat
-rw-r--r-- 1 root root     58949 Dec 11 10:06 TensileLibrary_lazy_gfx942.dat
Edit: I did also experiment on my system with a python script and setting different values:

HSA_OVERRIDE_GFX_VERSION=10.3.1:

rocBLAS error: Cannot read /home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1031
 List of available TensileLibrary Files : 
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/home/nicholas/development/image-rotation-AI-classification/resnet-ixion/.venv/lib/python3.10/site-packages/torch/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"

HSA_OVERRIDE_GFX_VERSION=11.0.2 or anything else that is on the list resulted in my system locking up cause AMD.

@przemekbialek
Copy link

przemekbialek commented Mar 9, 2025

Alright, so I learned something. Some gfx versions such as gfx1031 (RX 6700 XT, my GPU) are bundles with other versions such as gfx1030 while others are not. I had though there were only 3 bundles, but now I know there are more.

Some archs are builtin in ROCm, and than works without HSA_OVERRIDE_GFX_VERSION. If Your arch isn't supported than You may try with arch override. Sometimes it works.

This PR should have these versions as of ghcr.io/immich-app/immich-machine-learning:commit-ded3bbb033615385a2339e27171b6ab682a31df9-rocm

Yes. Support in ROCm doesn't means that there is support in other software. I added for mysel only two because I may test it on my hardware. In onnx runtime default arch list is shorter than in ROCm. Adding two archs in my image addes about 0,1GB to the size of original one:

immich-machine-learning-rocm-agfx            latest                                                 9bd3da10dee5   2 hours ago    31.9GB
ghcr.io/immich-app/immich-machine-learning   commit-ded3bbb033615385a2339e27171b6ab682a31df9-rocm   9f0b5a801e8a   3 days ago     31.8GB

@NicholasFlamy
Copy link
Member

Propably better way is to set CMAKE_HIP_ARCHITECTURES in Dockerfile but I didn't know how to do it so I made it brute force ;)

I might try that. I could pass that into the Dockerfile.

@przemekbialek
Copy link

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

Finally I found a workaround for 780M. When I set HSA_USE_SVM=0 environment variable crashes are gone. I may use HSA_OVERRIDE_GFX_VERSION=11.0.2 or HSA_OVERRIDE_GFX_VERSION=11.0.0, and everything works in immich-machinelearning as expected. I may set concurency in Smart Search and Face Detection to 5, set bigger multilanguage model to Smart Search, and all works, but I dont use graphical environment on this machine.

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Mar 10, 2025

When I set HSA_USE_SVM=0 environment variable crashes are gone.

I might have to try that on my RX 6700 XT lol.

Edit: It still has a few buggy things when running this PR.

@przemekbialek
Copy link

I tested this PR once again. I used latest commit with newer ROCm version, but rolled out changes to migraphix because it crashes for me. I added also gfx900 and gfx1102 archs to onnx runtime, this time the propper way. Bellow are changes that I made:

diff --git a/machine-learning/Dockerfile b/machine-learning/Dockerfile
index 216b15fca..51f60ab8c 100644
--- a/machine-learning/Dockerfile
+++ b/machine-learning/Dockerfile
@@ -1,4 +1,4 @@
-ARG DEVICE=cpu
+ARG DEVICE=rocm

 FROM python:3.11-bookworm@sha256:68a8863d0625f42d47e0684f33ca02f19d6094ef859a8af237aaf645195ed477 AS builder-cpu

@@ -40,7 +40,7 @@ RUN git apply /tmp/0001-disable-rocm-conv-algo-caching.patch
 RUN /bin/sh ./dockerfiles/scripts/install_common_deps.sh
 # Note: the `parallel` setting uses a substantial amount of RAM
 RUN ./build.sh --allow_running_as_root --config Release --build_wheel --update --build --parallel 17 --cmake_extra_defines\
-    ONNXRUNTIME_VERSION=1.20.1 --skip_tests --use_migraphx --migraphx_home=/opt/rocm
+    ONNXRUNTIME_VERSION=1.20.1 CMAKE_HIP_ARCHITECTURES="gfx900;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102;gfx940;gfx941;gfx942;gfx1200;gfx1201" --skip_tests --use_rocm --rocm_home=/opt/rocm
 RUN mv /code/onnxruntime/build/Linux/Release/dist/*.whl /opt/

 FROM builder-${DEVICE} AS builder
@@ -118,7 +118,7 @@ FROM prod-${DEVICE} AS prod
 ARG DEVICE

 RUN apt-get update && \
-    apt-get install -y --no-install-recommends tini $(if ! [ "$DEVICE" = "openvino" ] && ! [ "$DEVICE" = "rocm" ]; then echo "libmimalloc2.0"; fi) $(if [ "$DEVICE" = "rocm" ]; then echo "migraphx"; fi) && \
+    apt-get install -y --no-install-recommends tini $(if ! [ "$DEVICE" = "openvino" ] && ! [ "$DEVICE" = "rocm" ]; then echo "libmimalloc2.0"; fi) && \
     apt-get autoremove -yqq && \
     apt-get clean && \
     rm -rf /var/lib/apt/lists/*
diff --git a/machine-learning/app/models/constants.py b/machine-learning/app/models/constants.py
index 5824cd6c5..43088741b 100644
--- a/machine-learning/app/models/constants.py
+++ b/machine-learning/app/models/constants.py
@@ -65,7 +65,7 @@ _INSIGHTFACE_MODELS = {

 SUPPORTED_PROVIDERS = [
     "CUDAExecutionProvider",
-    "MIGraphXExecutionProvider",
+    "ROCMExecutionProvider",
     "OpenVINOExecutionProvider",
     "CPUExecutionProvider",
 ]
diff --git a/machine-learning/app/sessions/ort.py b/machine-learning/app/sessions/ort.py
index 00c7ad50a..d15f2d354 100644
--- a/machine-learning/app/sessions/ort.py
+++ b/machine-learning/app/sessions/ort.py
@@ -88,7 +88,7 @@ class OrtSession:
             match provider:
                 case "CPUExecutionProvider":
                     options = {"arena_extend_strategy": "kSameAsRequested"}
-                case "CUDAExecutionProvider":
+                case "CUDAExecutionProvider" | "ROCMExecutionProvider":
                     options = {"arena_extend_strategy": "kSameAsRequested", "device_id": settings.device_id}
                 case "OpenVINOExecutionProvider":
                     options = {

Configuration above was tested on hardware listed below:

  • Radeon RX6800 XT (gfx1030) - all works without tinkering.

  • Ryzen 8845HS with Radeon 780M (gfx1103) - I used HSA_OVERRIDE_GFX_VERSION=11.0.2 and HSA_OVERRIDE_GFX_VERSION=11.0.0 environment variables. To workaround crashes I must set HSA_USE_SVM=0.

  • Ryzen 5825U (gfx90c) - I used HSA_OVERRIDE_GFX_VERSION=9.0.0 environment variables.

When all is set as above I have no problems with running Smart Search and Facial Detection in pararell with with job concurency set to 5 (2 on Ryzen 5825U) for both jobs. I also tested on bigger models and this also works.
All tests runs on the same collection - 1,068 (3 GiB) of photos and 144 (75 GiB) videos.


On migraphx image ml workers crashed instantly even on Radeon RX 6800XT:

immich_machine_learning  | [03/10/25 20:44:56] ERROR    Worker (pid:542) was sent code 139!

In dmesg i see:

[ 2309.167940] gunicorn[27327]: segfault at 0 ip 00007f9405d7231c sp 00007f94337f9710 error 4 in libmigraphx.so.2011000.0.60304[1fb531c,7f9405a48000+1c0e000] likely on CPU 22 (core 6, socket 0)
[ 2309.167950] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc 55 41 57 41 56 53 48 83 ec 38 48 89 d3 49 89 fe 48 8d 7a 10 e8 c7 37 84 01 48 8b 30 <48> 8b 06 4c 8d 7c 24 08 4c 89 ff ff 90 a0 00 00 00 4c 89 ff 4c 89
[ 2324.523610] gunicorn[27881]: segfault at 27 ip 00007f93664586ba sp 00007f94413f49f0 error 4 in libmigraphx.so.2011000.0.60304[1d976ba,7f936634c000+1c0e000] likely on CPU 5 (core 5, socket 0)
[ 2324.523619] Code: 48 8d 47 68 c3 66 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 48 89 fb 48 8b 36 48 8b 06 <ff> 50 20 48 89 d8 5b c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[ 2339.303738] gunicorn[28525]: segfault at 20 ip 00007f9364b726ba sp 00007f942bff99f0 error 4 in libmigraphx.so.2011000.0.60304[1d976ba,7f9364a66000+1c0e000] likely on CPU 1 (core 1, socket 0)
[ 2339.303747] Code: 48 8d 47 68 c3 66 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 48 89 fb 48 8b 36 48 8b 06 <ff> 50 20 48 89 d8 5b c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

@NicholasFlamy
Copy link
Member

On migraphx image ml workers crashed instantly even on Radeon RX 6800XT:

Hmm okay, do you MigraphX installed on the host? For me it doesn't even get there because it errors about the MigraphX dependency.

@przemekbialek
Copy link

przemekbialek commented Mar 10, 2025

On migraphx image ml workers crashed instantly even on Radeon RX 6800XT:

Hmm okay, do you MigraphX installed on the host? For me it doesn't even get there because it errors about the MigraphX dependency.

No. I dont know migraphx and I assumed that everything needed is in docker image.

I found command to test installation of migraphx and run it in docker image without problems:

Running [ MIGraphX Version: 2.11.0.4b20cbc9 ]: /opt/rocm-6.3.4/bin/migraphx-driver perf --test
Compiling ...
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
output = @param:output -> float_type, {4, 3}, {3, 1}
b = @param:b -> float_type, {5, 3}, {3, 1}
a = @param:a -> float_type, {4, 5}, {5, 1}
@4 = gpu::code_object[code_object=4328,symbol_name=mlir_dot,global=128,local=128,](a,b,output) -> float_type, {4, 3}, {3, 1}


Allocating params ...
Running performance report ...
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}: 0.00036214ms, 2%
output = @param:output -> float_type, {4, 3}, {3, 1}: 0.00027652ms, 2%
b = @param:b -> float_type, {5, 3}, {3, 1}: 0.00025078ms, 2%
a = @param:a -> float_type, {4, 5}, {5, 1}: 0.00025372ms, 2%
@4 = gpu::code_object[code_object=4328,symbol_name=mlir_dot,global=128,local=128,](a,b,output) -> float_type, {4, 3}, {3, 1}: 0.0187583ms, 95%

Summary:
gpu::code_object::mlir_dot: 0.0187583ms / 1 = 0.0187583ms, 95%
@param: 0.00078102ms / 3 = 0.00026034ms, 4%
check_context::migraphx::gpu::context: 0.00036214ms / 1 = 0.00036214ms, 2%

Batch size: 1
Rate: 51131.3 inferences/sec
Total time: 0.0195575ms
Total instructions time: 0.0199014ms
Overhead time: 0.0006051ms, -0.00034396ms
Overhead: 3%, -2%
[ MIGraphX Version: 2.11.0.4b20cbc9 ] Complete: /opt/rocm-6.3.4/bin/migraphx-driver perf --test

I found that Smart Search alone works with migraphx even with concurency 5. The problem is with Face Detection.

@NicholasFlamy
Copy link
Member

I found that Smart Search alone works with migraphx even with concurency 5.

Oh wow, good job getting it running!

The problem is with Face Detection.

I have problems with Face Detection with regular ROCm too.

@przemekbialek
Copy link

I found that Smart Search alone works with migraphx even with concurency 5.

Oh wow, good job getting it running!

Thanks go to @mertalev. I only pulled image from this PR and tested it.

The problem is with Face Detection.

I have problems with Face Detection with regular ROCm too.

What problems do You have?

@ricklahaye
Copy link

ricklahaye commented Mar 11, 2025

I found that Smart Search alone works with migraphx even with concurency 5.

Oh wow, good job getting it running!

Thanks go to @mertalev. I only pulled image from this PR and tested it.

The problem is with Face Detection.

I have problems with Face Detection with regular ROCm too.

What problems do You have?

Smart search seem to run correctly.

Face detection fails:

immich_machine_learning  | [03/11/25 16:50:25] INFO     Setting execution providers to
immich_machine_learning  |                              ['MIGraphXExecutionProvider',
immich_machine_learning  |                              'CPUExecutionProvider'], in descending order of
immich_machine_learning  |                              preference
immich_machine_learning  | [03/11/25 16:50:25] DEBUG    Setting execution provider options to [{},
immich_machine_learning  |                              {'arena_extend_strategy': 'kSameAsRequested'}]
immich_machine_learning  | [03/11/25 16:50:25] DEBUG    Setting execution_mode to ORT_SEQUENTIAL
immich_machine_learning  | [03/11/25 16:50:25] DEBUG    Setting inter_op_num_threads to 0
immich_machine_learning  | [03/11/25 16:50:25] DEBUG    Setting intra_op_num_threads to 0
immich_machine_learning  | [03/11/25 16:50:29] ERROR    Worker (pid:9) was sent code 139!
immich_machine_learning  | [03/11/25 16:50:29] INFO     Booting worker with pid: 79
immich_machine_learning  | [03/11/25 16:50:30] DEBUG    Could not load ANN shared libraries, using ONNX:
immich_machine_learning  |                              libmali.so: cannot open shared object file: No such
immich_machine_learning  |                              file or directory
immich_machine_learning  | [03/11/25 16:50:31] INFO     Started server process [79]
immich_machine_learning  | [03/11/25 16:50:31] INFO     Waiting for application startup.

@mertalev
Copy link
Contributor Author

mertalev commented Mar 11, 2025

For those saying smart search runs correctly with MIGraphX, can you confirm that GPU utilization is high (it's okay for it to initially have high CPU utilization due to compilation, but it should eventually be primarily GPU utilization).

@ricklahaye
Copy link

ricklahaye commented Mar 11, 2025

For those saying smart search runs correctly with MIGraphX, can you confirm that GPU utilization is high (it's okay for it to initially have high CPU utilization due to compilation, but it should eventually be primarily GPU utilization).

I checked and its indeed using CPU and not the GPU! Apologies for previous statement saying 'it works'
I don't see any GPU utilization at all during smart search

@przemekbialek
Copy link

For those saying smart search runs correctly with MIGraphX, can you confirm that GPU utilization is high (it's okay for it to initially have high CPU utilization due to compilation, but it should eventually be primarily GPU utilization).

I checked and its indeed using CPU and not the GPU! Apologies for previous statement saying 'it works' I don't see any GPU utilization at all during smart search

Same for me. Only CPU utilization on migraphx image.

@NicholasFlamy
Copy link
Member

For those saying smart search runs correctly with MIGraphX, can you confirm that GPU utilization is high (it's okay for it to initially have high CPU utilization due to compilation, but it should eventually be primarily GPU utilization).

I checked and its indeed using CPU and not the GPU! Apologies for previous statement saying 'it works' I don't see any GPU utilization at all during smart search

Same for me. Only CPU utilization on migraphx image.

Yes, this was my issue. The error I had was resolved by a commit that mert added after I had the error. I forgot that that error was fixed and then it was only using CPU.

@ricklahaye
Copy link

I do want to add for everyone that when trying the later commit; hardware acceleration works for smart search and face detection!

I tried the original image/PR ghcr.io/immich-app/immich-machine-learning:pr-16613-rocm and that did not work, but later commits did work for me!

The one I used was ghcr.io/immich-app/immich-machine-learning:commit-ded3bbb033615385a2339e27171b6ab682a31df9-rocm

I did add the following environment variables: HSA_OVERRIDE_GFX_VERSION=11.0.0 and HSA_USE_SVM=0

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Mar 11, 2025

The one I used was ghcr.io/immich-app/immich-machine-learning:commit-ded3bbb033615385a2339e27171b6ab682a31df9-rocm

Yep, that version works, but have you tried 100 images and running Smart Search and Face Detection at the same time? I hit deadlock doing that.

Also, what GPU and what OS are you running?

@przemekbialek
Copy link

For those saying smart search runs correctly with MIGraphX, can you confirm that GPU utilization is high (it's okay for it to initially have high CPU utilization due to compilation, but it should eventually be primarily GPU utilization).

I tried to run some tests with migraphx image and run docker exec on it, next I pip install transformers psutil torch and then i run:

python3 -m onnxruntime.transformers.benchmark -g -m bert-base-cased --provider migraphx

This command returns with following error:

Please install onnxruntime-gpu or onnxruntime-directml package instead of onnxruntime, and use a machine with GPU for testing gpu performance.

I'm not an expert and don't know what am I doing, but maybe this will help You. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:feature documentation Improvements or additions to documentation 🧠machine-learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants