-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate memory R/W operations #151
Conversation
Utilize uftrace tool to analyze function tracing. Apply the following changes: --- a/Makefile
+++ b/Makefile
@@ -7,6 +7,7 @@ BIN := $(OUT)/rv32emu
CFLAGS = -std=gnu99 -O2 -Wall -Wextra
CFLAGS += -Wno-unused-label
CFLAGS += -include src/common.h
+CFLAGS += -g -pg
# Set the default stack pointer
CFLAGS += -D DEFAULT_STACK_ADDR=0xFFFFF000
@@ -80,7 +81,7 @@ endif
# For tail-call elimination, we need a specific set of build flags applied.
# FIXME: On macOS + Apple Silicon, -fno-stack-protector might have a negative impact.
-$(OUT)/emulate.o: CFLAGS += -fomit-frame-pointer -fno-stack-check -fno-stack-protector
+$(OUT)/emulate.o: CFLAGS += -fno-stack-check -fno-stack-protector
# Clear the .DEFAULT_GOAL special variable, so that the following turns
# to the first target after .DEFAULT_GOAL is not set. And then Run $ uftrace build/rv32emu build/hello.elf The reference output:
You shall show the corresponding difference based on function tracing. |
d9e1c2f
to
03e9b94
Compare
Based on the function tracing results from the uftrace tool, in the original memory design, sworiginal memory design
proposed memory design
lworiginal memory design
proposed memory design
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refine the git commit message by mentioning uftrace comparisons.
Tested under Apple M1: [ original ] [ proposed ] That is, 31% speedup. |
03e9b94
to
c065325
Compare
You shall refer the operating system by Apple to macOS rather than MacOS. It is a matter of brand identity. |
@@ -8,11 +8,8 @@ | |||
#include <stdint.h> | |||
|
|||
typedef struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments to describe the interaction between RISC-V core and the memory subsystem.
Always use the pair "original" vs. "proposed" while you are about to propose non-trivial changes by comprehensive comparisons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refine the messages.
Both "memory I/O" and "memory read/write" are commonly used terms in the field of computer architecture and memory management.
If you want to specifically focus on the operations of reading from and writing to memory, "memory read/write" is a more specific term. It highlights the fundamental actions of retrieving data from memory (read) and storing data in memory (write).
In this case, consider the context to use "memory read/write" (or "memory R/W" for short) which term aligns better.
In the previous memory design, we did not allocate any memory initially. As a result, we had to check whether memory was allocated before reading or writing data, leading to a deterioration in the performance of memory I/O instructions. To overcome this limitation, we have implemented a new memory design that directly maps a memory size of 2^32 bytes. With this enhancement, every memory R/W operation can directly access memory without the need for any checks. The following statistics demonstrate the significant performance improvement achieved by the new memory design for memory I/O instructions. | Environment | Metric | Old mem desgin | new mem desgin |Speedup| |------------------+----------+----------------+----------------+-------| |Core i7-8700 Linux| Dhrystone| 1151.44 DMIPS | 1374.20 DMIPS | +19.3%| | Apple M1 macOS | Dhrystone| 1246.44 DMIPS | 1638.12 DMIPS | +31.0%| Based on the function tracing results from the uftrace tool, in the original memory design, do_lw takes approximately 9 us to allocate data memory and read data. Even when it doesn't require allocating data memory, it still takes around 1.7 us to read data. However, with the proposed design, we only need about 1 us to read data without allocating any data memory. So does do_sw * original memory design [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory design [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operation function
c065325
to
dc95ca4
Compare
@qwe661234, The commit messages have been amended. For future pull requests, it is recommended to use more professional commit messages. General statements should avoid the use of 'we'. |
In the previous memory design, memory allocation was not initially performed, which led to the necessity of checking whether memory had been allocated before reading or writing data. This situation resulted in a deterioration in the performance of memory read/write instructions. To overcome this limitation, a new memory design has been implemented, which directly maps a memory size of 2^32 bytes. With this enhancement, every memory read/write operation can directly access memory without the need for any checks. Below are Dhrystone statistics (DMIPS) that demonstrate the performance improvement achieved by the new memory design in handling memory R/W instructions. | Environment | original | proposed | Speedup | |----------------+----------+----------+---------+ | i7-8700 Linux | 1151.44 | 1374.20 | +19.3% | | Apple M1 macOS | 1246.44 | 1638.12 | +31.0% | Based on the function tracing results obtained from the uftrace tool, in the original memory design, the 'do_lw' function takes approximately 9 us to allocate data memory and read data. Even when data memory allocation is not required, it still takes around 1.7 us to read data. However, with the proposed design, reading data without allocating any data memory only takes about 1 us. The same improvement applies to the 'do_sw' function as well. * original memory R/W [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory R/W [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operations
In the previous memory design, memory allocation was not initially performed, which led to the necessity of checking whether memory had been allocated before reading or writing data. This situation resulted in a deterioration in the performance of memory read/write instructions. To overcome this limitation, a new memory design has been implemented, which directly maps a memory size of 2^32 bytes. With this enhancement, every memory read/write operation can directly access memory without the need for any checks. Below are Dhrystone statistics (DMIPS) that demonstrate the performance improvement achieved by the new memory design in handling memory R/W instructions. | Environment | original | proposed | Speedup | |----------------+----------+----------+---------+ | i7-8700 Linux | 1151.44 | 1374.20 | +19.3% | | Apple M1 macOS | 1246.44 | 1638.12 | +31.0% | Based on the function tracing results obtained from the uftrace tool, in the original memory design, the 'do_lw' function takes approximately 9 us to allocate data memory and read data. Even when data memory allocation is not required, it still takes around 1.7 us to read data. However, with the proposed design, reading data without allocating any data memory only takes about 1 us. The same improvement applies to the 'do_sw' function as well. * original memory R/W [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory R/W [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operations
In the previous memory design, memory allocation was not initially performed, which led to the necessity of checking whether memory had been allocated before reading or writing data. This situation resulted in a deterioration in the performance of memory read/write instructions. To overcome this limitation, a new memory design has been implemented, which directly maps a memory size of 2^32 bytes. With this enhancement, every memory read/write operation can directly access memory without the need for any checks. Below are Dhrystone statistics (DMIPS) that demonstrate the performance improvement achieved by the new memory design in handling memory R/W instructions. | Environment | original | proposed | Speedup | |----------------+----------+----------+---------+ | i7-8700 Linux | 1151.44 | 1374.20 | +19.3% | | Apple M1 macOS | 1246.44 | 1638.12 | +31.0% | Based on the function tracing results obtained from the uftrace tool, in the original memory design, the 'do_lw' function takes approximately 9 us to allocate data memory and read data. Even when data memory allocation is not required, it still takes around 1.7 us to read data. However, with the proposed design, reading data without allocating any data memory only takes about 1 us. The same improvement applies to the 'do_sw' function as well. * original memory R/W [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory R/W [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operations
In the current memory design, we did not allocate any memory initially. As a result, we had to check whether memory was allocated before reading or writing data, leading to a deterioration in the performance of memory I/O instructions. To overcome this limitation, we have implemented a new memory design that directly maps a memory size of$2^{32}$ bytes. With this enhancement, every memory read or write operation can directly access memory without the need for any checks.
The following statistics demonstrate the performance improvements achieved by the proposed memory I/O operations.