Skip to content

Commit ed1945c

Browse files
committed
2 parents a7ade20 + 7c29e17 commit ed1945c

File tree

10 files changed

+16
-13
lines changed

10 files changed

+16
-13
lines changed

content/english/hpc/compilation/flags.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ There are 4 *and a half* main levels of optimization for speed in GCC:
1212

1313
- `-O0` is the default one that does no optimizations (although, in a sense, it does optimize: for compilation time).
1414
- `-O1` (also aliased as `-O`) does a few "low-hanging fruit" optimizations, almost not affecting the compilation time.
15-
- `-O2` enables all optimizations that are known to have little to no negative side effects and take reasonable time to complete (this is what most projects use for production builds).
15+
- `-O2` enables all optimizations that are known to have little to no negative side effects and take a reasonable time to complete (this is what most projects use for production builds).
1616
- `-O3` does very aggressive optimization, enabling almost all *correct* optimizations implemented in GCC.
1717
- `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e.g., floating-point operations may be rearranged so that the result is off by a few bits in the mantissa).
1818

content/english/hpc/compilation/situational.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ The whole process is automated by modern compilers. For example, the `-fprofile-
9696
g++ -fprofile-generate [other flags] source.cc -o binary
9797
```
9898

99-
After we run the program — preferably on input that is as representative of real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:
99+
After we run the program — preferably on input that is as representative of the real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:
100100

101101
```
102102
g++ -fprofile-use [other flags] source.cc -o binary

content/english/hpc/data-structures/binary-search.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Binary Search
33
weight: 1
4+
published: true
45
---
56

67
<!-- mention interpolation search and radix trees? -->
@@ -184,7 +185,7 @@ int lower_bound(int x) {
184185
185186
Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely.
186187
187-
As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
188+
As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
188189
189190
<!-- todo: update numbers -->
190191

content/english/hpc/external-memory/_index.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,15 @@ When you fetch anything from memory, the request goes through an incredibly comp
1919
2020
-->
2121

22-
When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency.
22+
When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency.
2323

2424
Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored:
2525

2626
- If the data is stored in the main memory (RAM), it will take around ~100ns, or about 200 cycles, to fetch it, and then another 200 cycles to write it back.
2727
- If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest.
2828
- But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it.
2929

30-
Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.
30+
Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.
3131

3232
![](img/memory-vs-compute.png)
3333

content/english/hpc/external-memory/hierarchy.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data.
5858

5959
### Non-Volatile Memory
6060

61-
While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms.
61+
While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms.
6262

6363
<!-- error correction -->
6464

content/english/hpc/external-memory/model.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat
1818

1919
In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution.
2020

21-
We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.
21+
We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.
2222

2323
### Array Scan
2424

content/english/hpc/number-theory/modular.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ $$
100100
$$
101101
\begin{aligned}
102102
a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
103-
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)}
103+
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)}
104104
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)}
105105
\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)}
106106
\\\ &= a

content/english/hpc/number-theory/montgomery.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Montgomery Multiplication
33
weight: 4
4+
published: true
45
---
56

67
Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size.
@@ -287,6 +288,6 @@ int inverse(int _a) {
287288
}
288289
```
289290
290-
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
291+
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
291292
292293
**Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).

content/english/hpc/simd/reduction.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ int sum_simd(v8si *a, int n) {
4646
}
4747
```
4848

49-
You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
49+
You can use this approach for other reductions, such as for finding the minimum or the xor-sum of an array.
5050

5151
### Instruction-Level Parallelism
5252

content/russian/cs/string-structures/aho-corasick.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
---
22
title: Алгоритм Ахо-Корасик
33
authors:
4-
- Сергей Слотин
4+
- Сергей Слотин
55
weight: 2
66
prerequisites:
7-
- trie
7+
- trie
8+
published: true
89
---
910

1011
Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно?
@@ -36,7 +37,7 @@ prerequisites:
3637

3738
**Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$.
3839

39-
**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$.
40+
**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$.
4041

4142
**Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же.
4243

0 commit comments

Comments
 (0)