Dirent lookup speedup #430

veloman-yunkan · 2020-09-27T20:45:35Z

This PR addresses #410 & #424

The speed-up of dirent look-up is achieved via an auxiliary special purpose in-memory cache (new helper class zim::NarrowDown). It stores the full urls (ns+path) and indices of uniformly spaced dirents and is filled upon opening the ZIM file. Then the dirent look-up operation is performed in two stages:

Narrow down the range that might contain the dirent with the given full url via a binary search in that cache.
Perform binary search on dirents (with disc access as needed) in that narrower range.

Since the speed-up is due to avoiding expensive IO operations, the goal is to fit as much data as possible in the in-memory cache. This is achieved as follows:

A sorted vector container is used instead of a tree (std::map) data structure.
Actual URLs (ns+path) are approximated with shortened versions that allow to narrow-down the initial full range with almost the same efficiency except a few corner cases (see below for the explanation of this trick).
(The shortened) URLs are not stored as std::string objects (which have an overhead of a couple of pointers). Instead they are packed sequentially as C-style strings in a single array, and 4-byte offsets into that array are used.

The idea of shortening the URLs is illustrated on an example:

Dirent #	Full URL (namespace concatenated with path)
...	...
1234	AList_of_Climate_Change_Initiatives
1235	AList_of_Greta_Thunberg_speeches
...	...

The example above illustrates two consecutive entries from the wikipedia_en_climate_change_nopic_2020-01.zim ZIM file (though the index values are not real). If we were to use the item #1234 in our cache the naive approach would be to store its full URL as is. However, let's imagine that the ZIM file also contained an item with full URL AList_of_G. Then it would have to reside between AList_of_Climate_Change_Initiatives and AList_of_Greta_Thunberg_speeches. So we can pretend that such an item exists and store in our cache the fictitious entry (AList_of_G, 1234.5). When we arrive at that entry during the range narrow-down step we must round the dirent index downward if it is going to be used as the lower bound of the range, and round it upward if it is going to be used as the upper bound of the range.

Benchmark results

Since this optimization effectively adds 1024 additional (but static) dirent lookup entries, the benchmark includes two runs of the master version with different values for the ZIM_DIRENTCACHE setting - 512 (the default) and 1536 (default + 1024).

In summary, this optimization improves the performance of zimcheck more than simply increasing the dirent cache size does, and doesn't negatively impact the performance of kiwix-serve.

Data used for benchmarking:

id	ZIM file	size (MB)	article count	cluster count
1	wikipedia_en_climate_change_nopic_2020-01.zim	31	7646	51
2	wikipedia_hy_all_mini_2020-07.zim	560	509325	1518
3	wikipedia_hy_all_mini_2020-08.zim	563	509611	1526

`zimcheck -A` benchmark

ZIM file id	zimcheck -A runtime (master, DIRENTCACHE=512)	zimcheck -A runtime (master, DIRENTCACHE=1536)	zimcheck -A runtime (this PR, DIRENTCACHE=512)	zimcheck -A runtime (this PR, DIRENTCACHE=1536)
1	6.183s	5.881s	5.693s	5.531s
2	6m36.615s	6m12.248s	5m51.896s	5m40.461s
3	6m31.202s	6m16.159s	5m55.336s	5m42.636s

`kiwix-serve` benchmark

The benchmark was performed using the benchmark_kiwix_serve script as follows:

ZIM_DIRENTCACHE=512 ./benchmark_kiwix_serve master 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim
ZIM_DIRENTCACHE=1536 ./benchmark_kiwix_serve master 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim
ZIM_DIRENTCACHE=512 ./benchmark_kiwix_serve pr430 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim

That script starts a kiwix server and fetches from it about 1000 articles using wget --recursive -l 1.

ZIM file id	wget runtime (master, DIRENTCACHE=512)	wget runtime (master, DIRENTCACHE=1536)	wget runtime (this PR, DIRENTCACHE=512)	wget runtime (this PR, DIRENTCACHE=1536)
3	36.621s	35.628s	36.210s	36.236s

(It was verified that the output directories created by wget from both runs are identical, with the exception of /random)

codecov · 2020-09-27T20:51:31Z

Codecov Report

Merging #430 into master will increase coverage by 0.75%.
The diff coverage is 98.82%.

@@            Coverage Diff             @@
##           master     #430      +/-   ##
==========================================
+ Coverage   47.58%   48.34%   +0.75%     
==========================================
  Files          74       75       +1     
  Lines        3314     3355      +41     
  Branches     1450     1468      +18     
==========================================
+ Hits         1577     1622      +45     
+ Misses       1737     1731       -6     
- Partials        0        2       +2

Impacted Files	Coverage Δ
src/narrowdown.h	`97.56% <97.56%> (ø)`
src/dirent_lookup.h	`98.41% <100.00%> (-1.59%)`	⬇️
src/fileimpl.cpp	`82.94% <100.00%> (+1.84%)`	⬆️
src/dirent.cpp	`95.45% <0.00%> (+2.27%)`	⬆️
src/_dirent.h	`100.00% <0.00%> (+2.94%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6556b0...d6d5b22. Read the comment docs.

mgautierfr · 2020-09-30T09:32:03Z

Before you go too further in this PR, it would be nice to do two PR. One moving the existing code into the new DirentLookup (without change in the algorithm). And another one adding the "fanout" algorithm (and it seems you don't follow the description provided in #424. Whatever is the best idea, we will probably discuss here).

veloman-yunkan · 2020-09-30T10:30:45Z

@kelson42 @mgautierfr

Please see the benchmarking results of this draft PR before deciding how to proceed from here.

kelson42 · 2020-09-30T11:13:28Z

@veloman-yunkan Thank you for the benchmark. If I understand properly, from a speed perspective we should better have a big dirent cache, but what does that imply in term of memory consumption? How much additional memory is needed to store 1024 Dirent entries?

EDIT: To be complete, it would be really helpful to have your PR with 1536 Dirents in the cache.

mgautierfr · 2020-09-30T11:52:22Z

Please see the benchmarking results of this draft PR before deciding how to proceed from here.

The idea of splitting the PR is still valid. The first PR should not change the performance (but improve the code organization) and the second PR will work on performance (and let us discuss about different algorithm possibilities)

I'm a bit surprised by the benchmark results.
zimcheck loop over the dirent without using the find method (binary search) and this PR (the performance part) is reducing the range in which search. So the performance should be on the kiwix-serve side, not the zimcheck.

veloman-yunkan · 2020-09-30T12:09:01Z

I'm a bit surprised by the benchmark results.
zimcheck loop over the dirent without using the find method (binary search)

zimcheck -A includes URL validation which checks that all internal links in the ZIM file are valid.

veloman-yunkan · 2020-09-30T12:47:43Z

To be complete, it would be really helpful to have your PR with 1536 Dirents in the cache.

@kelson42 Done

kelson42 · 2020-09-30T12:55:01Z

@mgautierfr @veloman-yunkan Thank you. I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

veloman-yunkan · 2020-09-30T12:55:05Z

If I understand properly, from a speed perspective we should better have a big dirent cache, but what does that imply in term of memory consumption? How much additional memory is needed to store 1024 Dirent entries?

@Kelson This PR doesn't cache the dirents. It caches only data needed for the initial stage of the lookup. In the current prototype version the item URL/path is fully cached. This proof-of-concept is not optimized for memory usage, and there is a way to reduce it significantly. A finely tuned implementation will use a a couple of dozen bytes per lookup cache entry.

veloman-yunkan · 2020-09-30T12:57:52Z

I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

1536 is not a magic number and hardly it is the sweet spot for a typical runtime environment. Using a larger number will likely result in even more speed-up on average.

kelson42 · 2020-09-30T13:05:13Z

I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

1536 is not a magic number and hardly it is the sweet spot for a typical runtime environment. Using a larger number will likely result in even more speed-up on average.

I had a bit the feeling the speed progression here might be asymptotic (and not linear) and therefore we could keep a pre-defined value which works pretty well (but not optimaly) everywhere. Otherwise we would be back to our discussion on #311 I believe. I will let you sort out the details with @mgautierfr. Just wanted to emphasis that (1) the improvements seem good (2) if the memory cost is not too big, we should not be afraid to increase significantly the DIRENTCACHE size.

veloman-yunkan · 2020-09-30T13:23:30Z

@kelson42 This PR will greatly reduce the utility from increasing the size of the general purpose dirent cache. Since dirent cache was heavily used during dirent look-up operations, we better have a dedicated, more optimal cache serving that operation better.

mgautierfr · 2020-09-30T13:34:28Z

In this PR you use a different approach of what I expose in #424.

Your implementation a 1024 items map and split the whole articles range into 1024 balanced subranges.
But the lookup for the subrange imply a upper_bound call (O(log(1024)), so O(1) as 1024 is const) which is a bit longer than a direct array indexing.

My proposition split the range in 256 sub-range (not balanced) (but allow subdividing sub-range into 256 sub-sub-range). It is a bit more complex to construct (as we build the subranges based on the content and so we have to parse the content). But the lookup is probably a bit faster as we do a simple array indexing. It also has the advantage to store less data (a 4 bytes offset per range) compared to yours (a 4 bytes offset and the dirent path).

It seems that your implementation is easier to implement (it is already implemented) and build the cache array faster. But it is potentially slower to lookup dirents once the cache is created. It seems my idea is better for long running process. (But yet again, I don't what is the real performance gain between the two ideas)

As you already said, your PR is draft and not optimized for memory usage. But I would like to know your point here. Does your algorithm is the "final" idea or it is just a first implementation of a array cache to have benchmark numbers and them we see what is best to implement ?

veloman-yunkan · 2020-09-30T14:08:19Z

@mgautierfr

Does your algorithm is the "final" idea or it is just a first implementation of a array cache to have benchmark numbers and them we see what is best to implement ?

This is only a prototype. I definitely know how to improve the memory consumption per cache item, so that we can afford more of them. But that preserves one drawback of this implementation - slow startup due to eagerly reading all those dirents. How serious is that issue? Since kiwix-serve can be started with a library consisting of a lot of ZIM files, is it acceptable to initialize such a lookup data structure eagerly? Should we instead try to fill it on demand?

mgautierfr · 2020-09-30T14:58:57Z

How serious is that issue?

I don't know :)
Your current implementation is probably not a issue. The cache filling is just 1024 dirent reads.
While it takes time, it is not so much. (The actual implementation searching for start of namespaces with binary search may reads dozens of dirents, and my algo will probably need hundreds)

Since kiwix-serve can be started with a library consisting of a lot of ZIM files, is it acceptable to initialize such a lookup data structure eagerly? Should we instead try to fill it on demand?

On kiwix-serve, I would say it is not a issue as we are in a long running process. The warmup time will not be so important, we will simply have to wait a bit longer before the server is up.
On a one shoot process, it may be more important, but it would be on only one zim.

On kiwix-serve, we open all zim files only to locate the xapian database to have multizim search. Else we would open zim file only on demand. I think we should have different strategy depending of the usage.
(Opening zim file without cache to locate xapian database or zimdump, Opening with cache for long process, ...). But this strategy selection will not be made here. For now, I think we can assume that the warmup time is not a real issue (if it is reasonable).

kelson42 · 2020-10-09T10:02:27Z

@veloman-yunkan Hi, any news here? We wait to that PR to:

complete speedup project on libzim
make a new release of lizim and zimtools to benefit of the full speedup

veloman-yunkan · 2020-10-23T13:31:34Z

@mgautierfr The updated description contains some useful information (I think that it makes sense to copy some of it into the code).

kelson42 · 2020-10-26T15:37:54Z

@mgautierfr @veloman-yunkan This is important to move on this because of the overall project(s) timeline but as well because discussing too long has not really proven to be that much beneficial in the past. So, at least for that PR, please try to be focused on the essential and would be great if we don't have days of latencies beetween each exchange. Would be really appreciated if this could be merged this week.

mgautierfr

This PR is pretty good !!

We may move the PR's description in the commits (commit message at least, maybe code) but there is nothing to say on the code.

This case algorithm uses far less memory than my proposal of the fanout algorithm. We may have to discuss about fanout later (if we need it) but for now this PR is ok.

Here some open questions (the first one to be answered before merging) :

The commit about not unique url (see my bellow comment on code)
Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.
We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)
Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.

src/narrowdown.h

veloman-yunkan · 2020-10-27T13:00:10Z

Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.

We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)

Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.

Can we please discuss those in a separate ticket?

veloman-yunkan · 2020-10-27T13:10:30Z

@kelson42 @mgautierfr Do you have any idea how the failure of the Macos CI builds can be fixed? For all of my recent pushes in this PR the Macos builds failed during the Install packages step (for example, Macos (native_dyn)):

Run mstksg/get-package@v1
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error: Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-openssl failed!
Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-python2 failed!
fatal: invalid upstream 'origin/master'
fatal: invalid upstream 'origin/master'
Error: Command failed: brew update && brew install gcovr pkg-config ninja
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error: Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-openssl failed!
Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-python2 failed!
fatal: invalid upstream 'origin/master'
fatal: invalid upstream 'origin/master'

kelson42 · 2020-10-27T13:23:05Z

@veloman-yunkan Not really, I have fixed one source of the problem (in git master already since a few days)... but now there is an other sympton.

mgautierfr

I approve the changes.
Please rebase -i to squash the fixup! commits before merging and we are good.

mgautierfr · 2020-10-27T13:29:08Z

@kelson42 @mgautierfr Do you have any idea how the failure of the Macos CI builds can be fixed?

I'm on it.
It seems that the github actions we use to install packages is failing. Directly using brew command work.
I'm update the CI in all our repositories.

veloman-yunkan · 2020-10-27T13:40:23Z

Please rebase -i to squash the fixup! commits before merging and we are good.

@mgautierfr Done

DirentLookup now has a lookup-grid which is filled on initialization with full paths (namespace + path) of equally spaced dirents. The first stage of the lookup for the given namespace+path combination happens in that in-memory lookup-grid (without access to actual dirents), providing the possible range of the search item. The second stage then continues as before by reading individual dirents and doing binary search on them. What is missing in this initial implementation is the detection of ranges that don't contain any dirents (for example, unused namespaces). One notable change brought by this commit is that the result of a search for an item in a non-existent namespace doesn't return an article index of 0. Instead, the index of the first article from the next available namespace is returned.

The size of the dirent lookup cache can be set via the ZIM_DIRENTLOOKUPCACHE environment variable.

test/data/example.zim from the kiwix/kiwix-lib repository contains two articles/items with full URL '-/j/js_modules/jsConfigVars.js'.

Reduced memory usage by zim::NarrowDown via compact storage of string content and avoiding std::string objects.

kelson42 · 2020-10-27T14:07:24Z

@mgautierfr @veloman-yunkan I think we can merge now.

veloman-yunkan force-pushed the dirent_lookup_speedup branch from 0a88bd8 to 2689bdf Compare September 27, 2020 20:49

kelson42 requested review from mgautierfr and removed request for mgautierfr September 27, 2020 21:22

veloman-yunkan force-pushed the dirent_lookup_speedup branch from a60a100 to e78b02f Compare September 28, 2020 20:31

This was linked to issues Sep 30, 2020

Optimization of random access to dirents #410

Closed

Use a "fanout" algorithm to speed up search for dirent. #424

Closed

mgautierfr force-pushed the fix_for_issue407 branch from 027a760 to 62249c7 Compare September 30, 2020 12:04

Base automatically changed from fix_for_issue407 to master September 30, 2020 12:15

veloman-yunkan changed the base branch from master to dirent_lookup_helper September 30, 2020 12:17

veloman-yunkan force-pushed the dirent_lookup_speedup branch from e78b02f to 61f034c Compare September 30, 2020 12:28

veloman-yunkan force-pushed the dirent_lookup_helper branch from 2689bdf to c9c3215 Compare September 30, 2020 12:29

veloman-yunkan changed the base branch from dirent_lookup_helper to master September 30, 2020 12:39

veloman-yunkan changed the base branch from master to dirent_lookup_helper September 30, 2020 12:39

Base automatically changed from dirent_lookup_helper to master September 30, 2020 13:41

veloman-yunkan marked this pull request as ready for review October 23, 2020 13:23

veloman-yunkan requested a review from mgautierfr October 23, 2020 13:23

mgautierfr reviewed Oct 27, 2020

View reviewed changes

src/narrowdown.h Show resolved Hide resolved

veloman-yunkan requested a review from mgautierfr October 27, 2020 13:00

mgautierfr approved these changes Oct 27, 2020

View reviewed changes

mgautierfr mentioned this pull request Oct 27, 2020

Cache strategy questions. #438

Open

veloman-yunkan force-pushed the dirent_lookup_speedup branch from 43701f2 to 06121dd Compare October 27, 2020 13:39

veloman-yunkan added 8 commits October 27, 2020 14:59

User control over the dirent lookup cache size

9820a12

The size of the dirent lookup cache can be set via the ZIM_DIRENTLOOKUPCACHE environment variable.

Moved some code around

0455faf

Using vector instead of map for dirent lookup grid

b180325

Enter zim::NarrowDown

dfd508e

NarrowDown internally stores shortened keys

5b203e3

Handling of ZIM files with non-unique URLs of items

0f17a44

test/data/example.zim from the kiwix/kiwix-lib repository contains two articles/items with full URL '-/j/js_modules/jsConfigVars.js'.

Introduced keyContentArea in zim::NarrowDown

d6d5b22

Reduced memory usage by zim::NarrowDown via compact storage of string content and avoiding std::string objects.

kelson42 force-pushed the dirent_lookup_speedup branch from 06121dd to d6d5b22 Compare October 27, 2020 13:59

mgautierfr merged commit e9b75f4 into master Oct 27, 2020

mgautierfr deleted the dirent_lookup_speedup branch October 27, 2020 14:11

This was referenced Oct 27, 2020

Removed redundant call to generic_getLinks() openzim/zim-tools#169

Merged

Fix for NarrowDown failing to locate the last item #440

Merged

veloman-yunkan mentioned this pull request Nov 28, 2020

High CPU usage when starting [Docker] kiwix/kiwix-tools#418

Closed

mgautierfr mentioned this pull request Feb 3, 2021

Better dirent access (reading) #486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dirent lookup speedup #430

Dirent lookup speedup #430

veloman-yunkan commented Sep 27, 2020 •

edited

Loading

codecov bot commented Sep 27, 2020 •

edited

Loading

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

kelson42 commented Sep 30, 2020 •

edited

Loading

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020 •

edited

Loading

kelson42 commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

kelson42 commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

mgautierfr commented Sep 30, 2020

kelson42 commented Oct 9, 2020

veloman-yunkan commented Oct 23, 2020

kelson42 commented Oct 26, 2020

mgautierfr left a comment

veloman-yunkan commented Oct 27, 2020

veloman-yunkan commented Oct 27, 2020

kelson42 commented Oct 27, 2020

mgautierfr left a comment

mgautierfr commented Oct 27, 2020

veloman-yunkan commented Oct 27, 2020

kelson42 commented Oct 27, 2020

Dirent lookup speedup #430

Dirent lookup speedup #430

Conversation

veloman-yunkan commented Sep 27, 2020 • edited Loading

Benchmark results

zimcheck -A benchmark

kiwix-serve benchmark

codecov bot commented Sep 27, 2020 • edited Loading

Codecov Report

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

kelson42 commented Sep 30, 2020 • edited Loading

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020 • edited Loading

kelson42 commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

kelson42 commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

mgautierfr commented Sep 30, 2020

veloman-yunkan commented Sep 30, 2020

mgautierfr commented Sep 30, 2020

kelson42 commented Oct 9, 2020

veloman-yunkan commented Oct 23, 2020

kelson42 commented Oct 26, 2020

mgautierfr left a comment

Choose a reason for hiding this comment

veloman-yunkan commented Oct 27, 2020

veloman-yunkan commented Oct 27, 2020

kelson42 commented Oct 27, 2020

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr commented Oct 27, 2020

veloman-yunkan commented Oct 27, 2020

kelson42 commented Oct 27, 2020

veloman-yunkan commented Sep 27, 2020 •

edited

Loading

`zimcheck -A` benchmark

`kiwix-serve` benchmark

codecov bot commented Sep 27, 2020 •

edited

Loading

kelson42 commented Sep 30, 2020 •

edited

Loading

veloman-yunkan commented Sep 30, 2020 •

edited

Loading