Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dirent lookup speedup #430

Merged
merged 8 commits into from
Oct 27, 2020
Merged

Dirent lookup speedup #430

merged 8 commits into from
Oct 27, 2020

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Sep 27, 2020

This PR addresses #410 & #424

The speed-up of dirent look-up is achieved via an auxiliary special purpose in-memory cache (new helper class zim::NarrowDown). It stores the full urls (ns+path) and indices of uniformly spaced dirents and is filled upon opening the ZIM file. Then the dirent look-up operation is performed in two stages:

  1. Narrow down the range that might contain the dirent with the given full url via a binary search in that cache.
  2. Perform binary search on dirents (with disc access as needed) in that narrower range.

Since the speed-up is due to avoiding expensive IO operations, the goal is to fit as much data as possible in the in-memory cache. This is achieved as follows:

  1. A sorted vector container is used instead of a tree (std::map) data structure.
  2. Actual URLs (ns+path) are approximated with shortened versions that allow to narrow-down the initial full range with almost the same efficiency except a few corner cases (see below for the explanation of this trick).
  3. (The shortened) URLs are not stored as std::string objects (which have an overhead of a couple of pointers). Instead they are packed sequentially as C-style strings in a single array, and 4-byte offsets into that array are used.

The idea of shortening the URLs is illustrated on an example:

Dirent # Full URL (namespace concatenated with path)
... ...
1234 AList_of_Climate_Change_Initiatives
1235 AList_of_Greta_Thunberg_speeches
... ...

The example above illustrates two consecutive entries from the wikipedia_en_climate_change_nopic_2020-01.zim ZIM file (though the index values are not real). If we were to use the item #1234 in our cache the naive approach would be to store its full URL as is. However, let's imagine that the ZIM file also contained an item with full URL AList_of_G. Then it would have to reside between AList_of_Climate_Change_Initiatives and AList_of_Greta_Thunberg_speeches. So we can pretend that such an item exists and store in our cache the fictitious entry (AList_of_G, 1234.5). When we arrive at that entry during the range narrow-down step we must round the dirent index downward if it is going to be used as the lower bound of the range, and round it upward if it is going to be used as the upper bound of the range.


Benchmark results

Since this optimization effectively adds 1024 additional (but static) dirent lookup entries, the benchmark includes two runs of the master version with different values for the ZIM_DIRENTCACHE setting - 512 (the default) and 1536 (default + 1024).

In summary, this optimization improves the performance of zimcheck more than simply increasing the dirent cache size does, and doesn't negatively impact the performance of kiwix-serve.

Data used for benchmarking:

id ZIM file size (MB) article count cluster count
1 wikipedia_en_climate_change_nopic_2020-01.zim 31 7646 51
2 wikipedia_hy_all_mini_2020-07.zim 560 509325 1518
3 wikipedia_hy_all_mini_2020-08.zim 563 509611 1526

zimcheck -A benchmark

ZIM file id zimcheck -A runtime (master, DIRENTCACHE=512) zimcheck -A runtime (master, DIRENTCACHE=1536) zimcheck -A runtime (this PR, DIRENTCACHE=512) zimcheck -A runtime (this PR, DIRENTCACHE=1536)
1 6.183s 5.881s 5.693s 5.531s
2 6m36.615s 6m12.248s 5m51.896s 5m40.461s
3 6m31.202s 6m16.159s 5m55.336s 5m42.636s

kiwix-serve benchmark

The benchmark was performed using the benchmark_kiwix_serve script as follows:

ZIM_DIRENTCACHE=512 ./benchmark_kiwix_serve master 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim
ZIM_DIRENTCACHE=1536 ./benchmark_kiwix_serve master 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim
ZIM_DIRENTCACHE=512 ./benchmark_kiwix_serve pr430 1000 zimfiles/wikipedia_hy_all_mini_2020-08.zim

That script starts a kiwix server and fetches from it about 1000 articles using wget --recursive -l 1.

ZIM file id wget runtime (master, DIRENTCACHE=512) wget runtime (master, DIRENTCACHE=1536) wget runtime (this PR, DIRENTCACHE=512) wget runtime (this PR, DIRENTCACHE=1536)
3 36.621s 35.628s 36.210s 36.236s

(It was verified that the output directories created by wget from both runs are identical, with the exception of /random)

@codecov
Copy link

codecov bot commented Sep 27, 2020

Codecov Report

Merging #430 into master will increase coverage by 0.75%.
The diff coverage is 98.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #430      +/-   ##
==========================================
+ Coverage   47.58%   48.34%   +0.75%     
==========================================
  Files          74       75       +1     
  Lines        3314     3355      +41     
  Branches     1450     1468      +18     
==========================================
+ Hits         1577     1622      +45     
+ Misses       1737     1731       -6     
- Partials        0        2       +2     
Impacted Files Coverage Δ
src/narrowdown.h 97.56% <97.56%> (ø)
src/dirent_lookup.h 98.41% <100.00%> (-1.59%) ⬇️
src/fileimpl.cpp 82.94% <100.00%> (+1.84%) ⬆️
src/dirent.cpp 95.45% <0.00%> (+2.27%) ⬆️
src/_dirent.h 100.00% <0.00%> (+2.94%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6556b0...d6d5b22. Read the comment docs.

@kelson42 kelson42 requested review from mgautierfr and removed request for mgautierfr September 27, 2020 21:22
@mgautierfr
Copy link
Collaborator

Before you go too further in this PR, it would be nice to do two PR. One moving the existing code into the new DirentLookup (without change in the algorithm). And another one adding the "fanout" algorithm (and it seems you don't follow the description provided in #424. Whatever is the best idea, we will probably discuss here).

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 @mgautierfr

Please see the benchmarking results of this draft PR before deciding how to proceed from here.

@kelson42
Copy link
Contributor

kelson42 commented Sep 30, 2020

@veloman-yunkan Thank you for the benchmark. If I understand properly, from a speed perspective we should better have a big dirent cache, but what does that imply in term of memory consumption? How much additional memory is needed to store 1024 Dirent entries?

EDIT: To be complete, it would be really helpful to have your PR with 1536 Dirents in the cache.

@mgautierfr
Copy link
Collaborator

Please see the benchmarking results of this draft PR before deciding how to proceed from here.

The idea of splitting the PR is still valid. The first PR should not change the performance (but improve the code organization) and the second PR will work on performance (and let us discuss about different algorithm possibilities)


I'm a bit surprised by the benchmark results.
zimcheck loop over the dirent without using the find method (binary search) and this PR (the performance part) is reducing the range in which search. So the performance should be on the kiwix-serve side, not the zimcheck.

@veloman-yunkan
Copy link
Collaborator Author

I'm a bit surprised by the benchmark results.
zimcheck loop over the dirent without using the find method (binary search)

zimcheck -A includes URL validation which checks that all internal links in the ZIM file are valid.

Base automatically changed from fix_for_issue407 to master September 30, 2020 12:15
@veloman-yunkan veloman-yunkan changed the base branch from master to dirent_lookup_helper September 30, 2020 12:17
@veloman-yunkan veloman-yunkan changed the base branch from dirent_lookup_helper to master September 30, 2020 12:39
@veloman-yunkan veloman-yunkan changed the base branch from master to dirent_lookup_helper September 30, 2020 12:39
@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Sep 30, 2020

To be complete, it would be really helpful to have your PR with 1536 Dirents in the cache.

@kelson42 Done

@kelson42
Copy link
Contributor

@mgautierfr @veloman-yunkan Thank you. I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

@veloman-yunkan
Copy link
Collaborator Author

If I understand properly, from a speed perspective we should better have a big dirent cache, but what does that imply in term of memory consumption? How much additional memory is needed to store 1024 Dirent entries?

@Kelson This PR doesn't cache the dirents. It caches only data needed for the initial stage of the lookup. In the current prototype version the item URL/path is fully cached. This proof-of-concept is not optimized for memory usage, and there is a way to reduce it significantly. A finely tuned implementation will use a a couple of dozen bytes per lookup cache entry.

@veloman-yunkan
Copy link
Collaborator Author

I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

1536 is not a magic number and hardly it is the sweet spot for a typical runtime environment. Using a larger number will likely result in even more speed-up on average.

@kelson42
Copy link
Contributor

I'm still unsure about the memory impact of DIRENTCACHE=1536, but if this is not taking too much memory, we should better use it?

1536 is not a magic number and hardly it is the sweet spot for a typical runtime environment. Using a larger number will likely result in even more speed-up on average.

I had a bit the feeling the speed progression here might be asymptotic (and not linear) and therefore we could keep a pre-defined value which works pretty well (but not optimaly) everywhere. Otherwise we would be back to our discussion on #311 I believe. I will let you sort out the details with @mgautierfr. Just wanted to emphasis that (1) the improvements seem good (2) if the memory cost is not too big, we should not be afraid to increase significantly the DIRENTCACHE size.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 This PR will greatly reduce the utility from increasing the size of the general purpose dirent cache. Since dirent cache was heavily used during dirent look-up operations, we better have a dedicated, more optimal cache serving that operation better.

@mgautierfr
Copy link
Collaborator

In this PR you use a different approach of what I expose in #424.

Your implementation a 1024 items map and split the whole articles range into 1024 balanced subranges.
But the lookup for the subrange imply a upper_bound call (O(log(1024)), so O(1) as 1024 is const) which is a bit longer than a direct array indexing.

My proposition split the range in 256 sub-range (not balanced) (but allow subdividing sub-range into 256 sub-sub-range). It is a bit more complex to construct (as we build the subranges based on the content and so we have to parse the content). But the lookup is probably a bit faster as we do a simple array indexing. It also has the advantage to store less data (a 4 bytes offset per range) compared to yours (a 4 bytes offset and the dirent path).

It seems that your implementation is easier to implement (it is already implemented) and build the cache array faster. But it is potentially slower to lookup dirents once the cache is created. It seems my idea is better for long running process. (But yet again, I don't what is the real performance gain between the two ideas)

As you already said, your PR is draft and not optimized for memory usage. But I would like to know your point here. Does your algorithm is the "final" idea or it is just a first implementation of a array cache to have benchmark numbers and them we see what is best to implement ?

Base automatically changed from dirent_lookup_helper to master September 30, 2020 13:41
@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr

Does your algorithm is the "final" idea or it is just a first implementation of a array cache to have benchmark numbers and them we see what is best to implement ?

This is only a prototype. I definitely know how to improve the memory consumption per cache item, so that we can afford more of them. But that preserves one drawback of this implementation - slow startup due to eagerly reading all those dirents. How serious is that issue? Since kiwix-serve can be started with a library consisting of a lot of ZIM files, is it acceptable to initialize such a lookup data structure eagerly? Should we instead try to fill it on demand?

@mgautierfr
Copy link
Collaborator

How serious is that issue?

I don't know :)
Your current implementation is probably not a issue. The cache filling is just 1024 dirent reads.
While it takes time, it is not so much. (The actual implementation searching for start of namespaces with binary search may reads dozens of dirents, and my algo will probably need hundreds)

Since kiwix-serve can be started with a library consisting of a lot of ZIM files, is it acceptable to initialize such a lookup data structure eagerly? Should we instead try to fill it on demand?

On kiwix-serve, I would say it is not a issue as we are in a long running process. The warmup time will not be so important, we will simply have to wait a bit longer before the server is up.
On a one shoot process, it may be more important, but it would be on only one zim.

On kiwix-serve, we open all zim files only to locate the xapian database to have multizim search. Else we would open zim file only on demand. I think we should have different strategy depending of the usage.
(Opening zim file without cache to locate xapian database or zimdump, Opening with cache for long process, ...). But this strategy selection will not be made here. For now, I think we can assume that the warmup time is not a real issue (if it is reasonable).

@kelson42
Copy link
Contributor

kelson42 commented Oct 9, 2020

@veloman-yunkan Hi, any news here? We wait to that PR to:

  • complete speedup project on libzim
  • make a new release of lizim and zimtools to benefit of the full speedup

@veloman-yunkan veloman-yunkan marked this pull request as ready for review October 23, 2020 13:23
@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr The updated description contains some useful information (I think that it makes sense to copy some of it into the code).

@kelson42
Copy link
Contributor

@mgautierfr @veloman-yunkan This is important to move on this because of the overall project(s) timeline but as well because discussing too long has not really proven to be that much beneficial in the past. So, at least for that PR, please try to be focused on the essential and would be great if we don't have days of latencies beetween each exchange. Would be really appreciated if this could be merged this week.

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is pretty good !!

We may move the PR's description in the commits (commit message at least, maybe code) but there is nothing to say on the code.

This case algorithm uses far less memory than my proposal of the fanout algorithm. We may have to discuss about fanout later (if we need it) but for now this PR is ok.

Here some open questions (the first one to be answered before merging) :

  • The commit about not unique url (see my bellow comment on code)
  • Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.
  • We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)
  • Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.

@veloman-yunkan
Copy link
Collaborator Author

  • Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.

  • We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)

  • Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.

Can we please discuss those in a separate ticket?

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 @mgautierfr Do you have any idea how the failure of the Macos CI builds can be fixed? For all of my recent pushes in this PR the Macos builds failed during the Install packages step (for example, Macos (native_dyn)):

Run mstksg/get-package@v1
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error: Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-openssl failed!
Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-python2 failed!
fatal: invalid upstream 'origin/master'
fatal: invalid upstream 'origin/master'
Error: Command failed: brew update && brew install gcovr pkg-config ninja
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Error: Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-openssl failed!
Fetching /usr/local/Homebrew/Library/Taps/local/homebrew-python2 failed!
fatal: invalid upstream 'origin/master'
fatal: invalid upstream 'origin/master'

@kelson42
Copy link
Contributor

@veloman-yunkan Not really, I have fixed one source of the problem (in git master already since a few days)... but now there is an other sympton.

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the changes.
Please rebase -i to squash the fixup! commits before merging and we are good.

@mgautierfr
Copy link
Collaborator

@kelson42 @mgautierfr Do you have any idea how the failure of the Macos CI builds can be fixed?

I'm on it.
It seems that the github actions we use to install packages is failing. Directly using brew command work.
I'm update the CI in all our repositories.

@veloman-yunkan
Copy link
Collaborator Author

Please rebase -i to squash the fixup! commits before merging and we are good.

@mgautierfr Done

DirentLookup now has a lookup-grid which is filled on initialization
with full paths (namespace + path) of equally spaced dirents. The first
stage of the lookup for the given namespace+path combination happens in
that in-memory lookup-grid (without access to actual dirents), providing
the possible range of the search item. The second stage then continues
as before by reading individual dirents and doing binary search on them.

What is missing in this initial implementation is the detection of
ranges that don't contain any dirents (for example, unused namespaces).

One notable change brought by this commit is that the result of a search
for an item in a non-existent namespace doesn't return an article index
of 0. Instead, the index of the first article from the next available
namespace is returned.
The size of the dirent lookup cache can be set via the
ZIM_DIRENTLOOKUPCACHE environment variable.
test/data/example.zim from the kiwix/kiwix-lib repository contains
two articles/items with full URL '-/j/js_modules/jsConfigVars.js'.
Reduced memory usage by zim::NarrowDown via compact storage of string
content and avoiding std::string objects.
@kelson42 kelson42 force-pushed the dirent_lookup_speedup branch from 06121dd to d6d5b22 Compare October 27, 2020 13:59
@kelson42
Copy link
Contributor

@mgautierfr @veloman-yunkan I think we can merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a "fanout" algorithm to speed up search for dirent. Optimization of random access to dirents
4 participants