Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate memory R/W operations #151

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

qwe661234
Copy link
Collaborator

@qwe661234 qwe661234 commented Jul 6, 2023

In the current memory design, we did not allocate any memory initially. As a result, we had to check whether memory was allocated before reading or writing data, leading to a deterioration in the performance of memory I/O instructions. To overcome this limitation, we have implemented a new memory design that directly maps a memory size of $2^{32}$ bytes. With this enhancement, every memory read or write operation can directly access memory without the need for any checks.

The following statistics demonstrate the performance improvements achieved by the proposed memory I/O operations.

Metric original proposed Speedup
Dhrystone 1151.44 1374.20 +19.3%

@jserv
Copy link
Contributor

jserv commented Jul 6, 2023

Utilize uftrace tool to analyze function tracing.

Apply the following changes:

--- a/Makefile
+++ b/Makefile
@@ -7,6 +7,7 @@ BIN := $(OUT)/rv32emu
 CFLAGS = -std=gnu99 -O2 -Wall -Wextra
 CFLAGS += -Wno-unused-label
 CFLAGS += -include src/common.h
+CFLAGS += -g -pg
 
 # Set the default stack pointer
 CFLAGS += -D DEFAULT_STACK_ADDR=0xFFFFF000
@@ -80,7 +81,7 @@ endif
 
 # For tail-call elimination, we need a specific set of build flags applied.
 # FIXME: On macOS + Apple Silicon, -fno-stack-protector might have a negative impact.
-$(OUT)/emulate.o: CFLAGS += -fomit-frame-pointer -fno-stack-check -fno-stack-protector
+$(OUT)/emulate.o: CFLAGS += -fno-stack-check -fno-stack-protector
 
 # Clear the .DEFAULT_GOAL special variable, so that the following turns
 # to the first target after .DEFAULT_GOAL is not set.

And then make clean all.

Run

$ uftrace build/rv32emu build/hello.elf 

The reference output:

            [233860] |   memory_new() {
  13.068 us [233860] |     calloc();
            [233860] |     mpool_create() {
   0.503 us [233860] |       malloc();
   0.212 us [233860] |       getpagesize();
   3.541 us [233860] |       mmap();
  17.226 us [233860] |     } /* mpool_create */
  31.291 us [233860] |   } /* memory_new */

You shall show the corresponding difference based on function tracing.

@jserv jserv changed the title Improve memory I/O Improve memory I/O operations Jul 6, 2023
@qwe661234 qwe661234 force-pushed the improve_memory_io branch from d9e1c2f to 03e9b94 Compare July 7, 2023 08:20
@qwe661234
Copy link
Collaborator Author

qwe661234 commented Jul 7, 2023

Utilize uftrace tool to analyze function tracing.

Based on the function tracing results from the uftrace tool, in the original memory design, do_lw takes approximately 9 us to allocate data memory and read data. Even when it doesn't require allocating data memory, it still takes around 1.7 us to read data. However, with the proposed design, we only need about 1 us to read data without allocating any data memory. Hence, the same applies to do_sw.

sw

original memory design

            [3659971] |    do_sw() {
            [3659971] |        on_mem_write_w() {
   0.255 us [3659971] |            rv_userdata();
   0.220 us [3659971] |            memory_write_w();
   9.141 us [3659971] |        } /* on_mem_write_w */
   ...
            [3659971] |    do_sw() {
            [3659971] |        on_mem_write_w() {
   0.265 us [3659971] |            rv_userdata();
   0.225 us [3659971] |            memory_write_w();
   1.795 us [3659971] |        } /* on_mem_write_w */

proposed memory design

            [3658503] |    do_sw() {
            [3658503] |        on_mem_write_w() {
   0.245 us [3658503] |            memory_write_w();
   1.055 us [3658503] |        } /* on_mem_write_w */

lw

original memory design

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

proposed memory design

       
            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

@qwe661234 qwe661234 requested a review from jserv July 7, 2023 08:22
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine the git commit message by mentioning uftrace comparisons.

@jserv jserv changed the title Improve memory I/O operations Consolidate memory I/O operations Jul 7, 2023
@jserv
Copy link
Contributor

jserv commented Jul 7, 2023

Tested under Apple M1:

[ original ]
Average DMIPS : 1246.44

[ proposed ]
Average DMIPS : 1638.12

That is, 31% speedup.

@qwe661234 qwe661234 force-pushed the improve_memory_io branch from 03e9b94 to c065325 Compare July 7, 2023 09:07
@qwe661234 qwe661234 requested a review from jserv July 7, 2023 09:09
@jserv
Copy link
Contributor

jserv commented Jul 7, 2023

You shall refer the operating system by Apple to macOS rather than MacOS. It is a matter of brand identity.

@@ -8,11 +8,8 @@
#include <stdint.h>

typedef struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to describe the interaction between RISC-V core and the memory subsystem.

@jserv
Copy link
Contributor

jserv commented Jul 7, 2023

Always use the pair "original" vs. "proposed" while you are about to propose non-trivial changes by comprehensive comparisons.

@jserv jserv changed the title Consolidate memory I/O operations Consolidate memory R/W operations Jul 7, 2023
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refine the messages.

Both "memory I/O" and "memory read/write" are commonly used terms in the field of computer architecture and memory management.

If you want to specifically focus on the operations of reading from and writing to memory, "memory read/write" is a more specific term. It highlights the fundamental actions of retrieving data from memory (read) and storing data in memory (write).

In this case, consider the context to use "memory read/write" (or "memory R/W" for short) which term aligns better.

In the previous memory design, we did not allocate any memory initially.
As a result, we had to check whether memory was allocated before reading
or writing data, leading to a deterioration in the performance of memory
I/O instructions. To overcome this limitation, we have implemented a new
memory design that directly maps a memory size of 2^32 bytes. With this
enhancement, every memory R/W operation can directly access memory without
the need for any checks.

The following statistics demonstrate the significant performance improvement
achieved by the new memory design for memory I/O instructions.

|   Environment    |  Metric  | Old mem desgin | new mem desgin |Speedup|
|------------------+----------+----------------+----------------+-------|
|Core i7-8700 Linux| Dhrystone| 1151.44 DMIPS  | 1374.20 DMIPS  | +19.3%|
|  Apple M1 macOS  | Dhrystone| 1246.44 DMIPS  | 1638.12 DMIPS  | +31.0%|

Based on the function tracing results from the uftrace tool, in the original
memory design, do_lw takes approximately 9 us to allocate data memory and
read data. Even when it doesn't require allocating data memory, it still
takes around 1.7 us to read data. However, with the proposed design, we
only need about 1 us to read data without allocating any data memory.
So does do_sw

* original memory design

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

* proposed memory design

            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

TODO: inline the essential memory operation function
@qwe661234 qwe661234 force-pushed the improve_memory_io branch from c065325 to dc95ca4 Compare July 7, 2023 09:50
@qwe661234 qwe661234 requested a review from jserv July 7, 2023 09:55
@jserv jserv merged commit ac3b122 into sysprog21:master Jul 7, 2023
@jserv
Copy link
Contributor

jserv commented Jul 7, 2023

@qwe661234, The commit messages have been amended. For future pull requests, it is recommended to use more professional commit messages. General statements should avoid the use of 'we'.

jserv pushed a commit that referenced this pull request Jul 7, 2023
In the previous memory design, memory allocation was not initially
performed, which led to the necessity of checking whether memory had
been allocated before reading or writing data. This situation resulted
in a deterioration in the performance of memory read/write instructions.
To overcome this limitation, a new memory design has been implemented,
which directly maps a memory size of 2^32 bytes. With this enhancement,
every memory read/write operation can directly access memory without the
need for any checks.

Below are Dhrystone statistics (DMIPS) that demonstrate the performance
improvement achieved by the new memory design in handling memory R/W
instructions.

|   Environment  | original | proposed | Speedup |
|----------------+----------+----------+---------+
| i7-8700 Linux  |  1151.44 |  1374.20 |  +19.3% |
| Apple M1 macOS |  1246.44 |  1638.12 |  +31.0% |

Based on the function tracing results obtained from the uftrace tool,
in the original memory design, the 'do_lw' function takes approximately
9 us to allocate data memory and read data. Even when data memory
allocation is not required, it still takes around 1.7 us to read data.
However, with the proposed design, reading data without allocating any
data memory only takes about 1 us. The same improvement applies to the
'do_sw' function as well.

* original memory R/W

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

* proposed memory R/W

            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

TODO: inline the essential memory operations
@qwe661234 qwe661234 deleted the improve_memory_io branch July 8, 2023 05:18
2011eric pushed a commit to 2011eric/rv32emu that referenced this pull request Jul 22, 2023
In the previous memory design, memory allocation was not initially
performed, which led to the necessity of checking whether memory had
been allocated before reading or writing data. This situation resulted
in a deterioration in the performance of memory read/write instructions.
To overcome this limitation, a new memory design has been implemented,
which directly maps a memory size of 2^32 bytes. With this enhancement,
every memory read/write operation can directly access memory without the
need for any checks.

Below are Dhrystone statistics (DMIPS) that demonstrate the performance
improvement achieved by the new memory design in handling memory R/W
instructions.

|   Environment  | original | proposed | Speedup |
|----------------+----------+----------+---------+
| i7-8700 Linux  |  1151.44 |  1374.20 |  +19.3% |
| Apple M1 macOS |  1246.44 |  1638.12 |  +31.0% |

Based on the function tracing results obtained from the uftrace tool,
in the original memory design, the 'do_lw' function takes approximately
9 us to allocate data memory and read data. Even when data memory
allocation is not required, it still takes around 1.7 us to read data.
However, with the proposed design, reading data without allocating any
data memory only takes about 1 us. The same improvement applies to the
'do_sw' function as well.

* original memory R/W

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

* proposed memory R/W

            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

TODO: inline the essential memory operations
vestata pushed a commit to vestata/rv32emu that referenced this pull request Jan 24, 2025
In the previous memory design, memory allocation was not initially
performed, which led to the necessity of checking whether memory had
been allocated before reading or writing data. This situation resulted
in a deterioration in the performance of memory read/write instructions.
To overcome this limitation, a new memory design has been implemented,
which directly maps a memory size of 2^32 bytes. With this enhancement,
every memory read/write operation can directly access memory without the
need for any checks.

Below are Dhrystone statistics (DMIPS) that demonstrate the performance
improvement achieved by the new memory design in handling memory R/W
instructions.

|   Environment  | original | proposed | Speedup |
|----------------+----------+----------+---------+
| i7-8700 Linux  |  1151.44 |  1374.20 |  +19.3% |
| Apple M1 macOS |  1246.44 |  1638.12 |  +31.0% |

Based on the function tracing results obtained from the uftrace tool,
in the original memory design, the 'do_lw' function takes approximately
9 us to allocate data memory and read data. Even when data memory
allocation is not required, it still takes around 1.7 us to read data.
However, with the proposed design, reading data without allocating any
data memory only takes about 1 us. The same improvement applies to the
'do_sw' function as well.

* original memory R/W

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

* proposed memory R/W

            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

TODO: inline the essential memory operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants