Consolidate memory R/W operations #151

qwe661234 · 2023-07-06T13:14:30Z

In the current memory design, we did not allocate any memory initially. As a result, we had to check whether memory was allocated before reading or writing data, leading to a deterioration in the performance of memory I/O instructions. To overcome this limitation, we have implemented a new memory design that directly maps a memory size of $2^{32}$ bytes. With this enhancement, every memory read or write operation can directly access memory without the need for any checks.

The following statistics demonstrate the performance improvements achieved by the proposed memory I/O operations.

Metric	original	proposed	Speedup
Dhrystone	1151.44	1374.20	+19.3%

src/elf.c

jserv · 2023-07-06T15:59:56Z

Utilize uftrace tool to analyze function tracing.

Apply the following changes:

--- a/Makefile
+++ b/Makefile
@@ -7,6 +7,7 @@ BIN := $(OUT)/rv32emu
 CFLAGS = -std=gnu99 -O2 -Wall -Wextra
 CFLAGS += -Wno-unused-label
 CFLAGS += -include src/common.h
+CFLAGS += -g -pg
 
 # Set the default stack pointer
 CFLAGS += -D DEFAULT_STACK_ADDR=0xFFFFF000
@@ -80,7 +81,7 @@ endif
 
 # For tail-call elimination, we need a specific set of build flags applied.
 # FIXME: On macOS + Apple Silicon, -fno-stack-protector might have a negative impact.
-$(OUT)/emulate.o: CFLAGS += -fomit-frame-pointer -fno-stack-check -fno-stack-protector
+$(OUT)/emulate.o: CFLAGS += -fno-stack-check -fno-stack-protector
 
 # Clear the .DEFAULT_GOAL special variable, so that the following turns
 # to the first target after .DEFAULT_GOAL is not set.

And then make clean all.

Run

$ uftrace build/rv32emu build/hello.elf

The reference output:

            [233860] |   memory_new() {
  13.068 us [233860] |     calloc();
            [233860] |     mpool_create() {
   0.503 us [233860] |       malloc();
   0.212 us [233860] |       getpagesize();
   3.541 us [233860] |       mmap();
  17.226 us [233860] |     } /* mpool_create */
  31.291 us [233860] |   } /* memory_new */

You shall show the corresponding difference based on function tracing.

qwe661234 · 2023-07-07T08:21:48Z

Utilize uftrace tool to analyze function tracing.

Based on the function tracing results from the uftrace tool, in the original memory design, do_lw takes approximately 9 us to allocate data memory and read data. Even when it doesn't require allocating data memory, it still takes around 1.7 us to read data. However, with the proposed design, we only need about 1 us to read data without allocating any data memory. Hence, the same applies to do_sw.

sw

original memory design

            [3659971] |    do_sw() {
            [3659971] |        on_mem_write_w() {
   0.255 us [3659971] |            rv_userdata();
   0.220 us [3659971] |            memory_write_w();
   9.141 us [3659971] |        } /* on_mem_write_w */
   ...
            [3659971] |    do_sw() {
            [3659971] |        on_mem_write_w() {
   0.265 us [3659971] |            rv_userdata();
   0.225 us [3659971] |            memory_write_w();
   1.795 us [3659971] |        } /* on_mem_write_w */

proposed memory design

            [3658503] |    do_sw() {
            [3658503] |        on_mem_write_w() {
   0.245 us [3658503] |            memory_write_w();
   1.055 us [3658503] |        } /* on_mem_write_w */

lw

original memory design

            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.265 us [3659971] |            rv_userdata();
   0.255 us [3659971] |            memory_read_w();
   9.626 us [3659971] |         } /* on_mem_read_w */
   ...
            [3659971] |    do_lw() {
            [3659971] |        on_mem_read_w() {
   0.260 us [3659971] |            rv_userdata();
   0.235 us [3659971] |            memory_read_w();
   1.485 us [3659971] |        } /* on_mem_read_w */

proposed memory design

       
            [3658503] |    do_lw() {
            [3658503] |        on_mem_read_w() {
   0.265 us [3658503] |            memory_read_w();
   1.025 us [3658503] |        } /* on_mem_read_w */

jserv

Refine the git commit message by mentioning uftrace comparisons.

src/io.c

src/io.h

jserv · 2023-07-07T08:55:38Z

Tested under Apple M1:

[ original ]
Average DMIPS : 1246.44

[ proposed ]
Average DMIPS : 1638.12

That is, 31% speedup.

jserv · 2023-07-07T09:21:47Z

You shall refer the operating system by Apple to macOS rather than MacOS. It is a matter of brand identity.

jserv · 2023-07-07T09:24:15Z

src/io.h

@@ -8,11 +8,8 @@
 #include <stdint.h>

 typedef struct {


Add some comments to describe the interaction between RISC-V core and the memory subsystem.

jserv · 2023-07-07T09:26:33Z

Always use the pair "original" vs. "proposed" while you are about to propose non-trivial changes by comprehensive comparisons.

jserv

Refine the messages.

Both "memory I/O" and "memory read/write" are commonly used terms in the field of computer architecture and memory management.

If you want to specifically focus on the operations of reading from and writing to memory, "memory read/write" is a more specific term. It highlights the fundamental actions of retrieving data from memory (read) and storing data in memory (write).

In this case, consider the context to use "memory read/write" (or "memory R/W" for short) which term aligns better.

In the previous memory design, we did not allocate any memory initially. As a result, we had to check whether memory was allocated before reading or writing data, leading to a deterioration in the performance of memory I/O instructions. To overcome this limitation, we have implemented a new memory design that directly maps a memory size of 2^32 bytes. With this enhancement, every memory R/W operation can directly access memory without the need for any checks. The following statistics demonstrate the significant performance improvement achieved by the new memory design for memory I/O instructions. | Environment | Metric | Old mem desgin | new mem desgin |Speedup| |------------------+----------+----------------+----------------+-------| |Core i7-8700 Linux| Dhrystone| 1151.44 DMIPS | 1374.20 DMIPS | +19.3%| | Apple M1 macOS | Dhrystone| 1246.44 DMIPS | 1638.12 DMIPS | +31.0%| Based on the function tracing results from the uftrace tool, in the original memory design, do_lw takes approximately 9 us to allocate data memory and read data. Even when it doesn't require allocating data memory, it still takes around 1.7 us to read data. However, with the proposed design, we only need about 1 us to read data without allocating any data memory. So does do_sw * original memory design [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory design [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operation function

jserv · 2023-07-07T10:10:02Z

@qwe661234, The commit messages have been amended. For future pull requests, it is recommended to use more professional commit messages. General statements should avoid the use of 'we'.

In the previous memory design, memory allocation was not initially performed, which led to the necessity of checking whether memory had been allocated before reading or writing data. This situation resulted in a deterioration in the performance of memory read/write instructions. To overcome this limitation, a new memory design has been implemented, which directly maps a memory size of 2^32 bytes. With this enhancement, every memory read/write operation can directly access memory without the need for any checks. Below are Dhrystone statistics (DMIPS) that demonstrate the performance improvement achieved by the new memory design in handling memory R/W instructions. | Environment | original | proposed | Speedup | |----------------+----------+----------+---------+ | i7-8700 Linux | 1151.44 | 1374.20 | +19.3% | | Apple M1 macOS | 1246.44 | 1638.12 | +31.0% | Based on the function tracing results obtained from the uftrace tool, in the original memory design, the 'do_lw' function takes approximately 9 us to allocate data memory and read data. Even when data memory allocation is not required, it still takes around 1.7 us to read data. However, with the proposed design, reading data without allocating any data memory only takes about 1 us. The same improvement applies to the 'do_sw' function as well. * original memory R/W [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.265 us [3659971] | rv_userdata(); 0.255 us [3659971] | memory_read_w(); 9.626 us [3659971] | } /* on_mem_read_w */ ... [3659971] | do_lw() { [3659971] | on_mem_read_w() { 0.260 us [3659971] | rv_userdata(); 0.235 us [3659971] | memory_read_w(); 1.485 us [3659971] | } /* on_mem_read_w */ * proposed memory R/W [3658503] | do_lw() { [3658503] | on_mem_read_w() { 0.265 us [3658503] | memory_read_w(); 1.025 us [3658503] | } /* on_mem_read_w */ TODO: inline the essential memory operations

jserv reviewed Jul 6, 2023

View reviewed changes

src/elf.c Outdated Show resolved Hide resolved

jserv changed the title ~~Improve memory I/O~~ Improve memory I/O operations Jul 6, 2023

qwe661234 force-pushed the improve_memory_io branch from d9e1c2f to 03e9b94 Compare July 7, 2023 08:20

qwe661234 requested a review from jserv July 7, 2023 08:22

jserv requested changes Jul 7, 2023

View reviewed changes