-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dirent lookup speedup #430
Conversation
0a88bd8
to
2689bdf
Compare
Codecov Report
@@ Coverage Diff @@
## master #430 +/- ##
==========================================
+ Coverage 47.58% 48.34% +0.75%
==========================================
Files 74 75 +1
Lines 3314 3355 +41
Branches 1450 1468 +18
==========================================
+ Hits 1577 1622 +45
+ Misses 1737 1731 -6
- Partials 0 2 +2
Continue to review full report at Codecov.
|
a60a100
to
e78b02f
Compare
Before you go too further in this PR, it would be nice to do two PR. One moving the existing code into the new DirentLookup (without change in the algorithm). And another one adding the "fanout" algorithm (and it seems you don't follow the description provided in #424. Whatever is the best idea, we will probably discuss here). |
Please see the benchmarking results of this draft PR before deciding how to proceed from here. |
@veloman-yunkan Thank you for the benchmark. If I understand properly, from a speed perspective we should better have a big dirent cache, but what does that imply in term of memory consumption? How much additional memory is needed to store 1024 Dirent entries? EDIT: To be complete, it would be really helpful to have your PR with 1536 Dirents in the cache. |
The idea of splitting the PR is still valid. The first PR should not change the performance (but improve the code organization) and the second PR will work on performance (and let us discuss about different algorithm possibilities) I'm a bit surprised by the benchmark results. |
027a760
to
62249c7
Compare
|
e78b02f
to
61f034c
Compare
2689bdf
to
c9c3215
Compare
@kelson42 Done |
@mgautierfr @veloman-yunkan Thank you. I'm still unsure about the memory impact of |
@Kelson This PR doesn't cache the dirents. It caches only data needed for the initial stage of the lookup. In the current prototype version the item URL/path is fully cached. This proof-of-concept is not optimized for memory usage, and there is a way to reduce it significantly. A finely tuned implementation will use a a couple of dozen bytes per lookup cache entry. |
1536 is not a magic number and hardly it is the sweet spot for a typical runtime environment. Using a larger number will likely result in even more speed-up on average. |
I had a bit the feeling the speed progression here might be asymptotic (and not linear) and therefore we could keep a pre-defined value which works pretty well (but not optimaly) everywhere. Otherwise we would be back to our discussion on #311 I believe. I will let you sort out the details with @mgautierfr. Just wanted to emphasis that (1) the improvements seem good (2) if the memory cost is not too big, we should not be afraid to increase significantly the |
@kelson42 This PR will greatly reduce the utility from increasing the size of the general purpose dirent cache. Since dirent cache was heavily used during dirent look-up operations, we better have a dedicated, more optimal cache serving that operation better. |
In this PR you use a different approach of what I expose in #424. Your implementation a 1024 items map and split the whole articles range into 1024 balanced subranges. My proposition split the range in 256 sub-range (not balanced) (but allow subdividing sub-range into 256 sub-sub-range). It is a bit more complex to construct (as we build the subranges based on the content and so we have to parse the content). But the lookup is probably a bit faster as we do a simple array indexing. It also has the advantage to store less data (a 4 bytes offset per range) compared to yours (a 4 bytes offset and the dirent path). It seems that your implementation is easier to implement (it is already implemented) and build the cache array faster. But it is potentially slower to lookup dirents once the cache is created. It seems my idea is better for long running process. (But yet again, I don't what is the real performance gain between the two ideas) As you already said, your PR is draft and not optimized for memory usage. But I would like to know your point here. Does your algorithm is the "final" idea or it is just a first implementation of a array cache to have benchmark numbers and them we see what is best to implement ? |
This is only a prototype. I definitely know how to improve the memory consumption per cache item, so that we can afford more of them. But that preserves one drawback of this implementation - slow startup due to eagerly reading all those dirents. How serious is that issue? Since kiwix-serve can be started with a library consisting of a lot of ZIM files, is it acceptable to initialize such a lookup data structure eagerly? Should we instead try to fill it on demand? |
I don't know :)
On kiwix-serve, I would say it is not a issue as we are in a long running process. The warmup time will not be so important, we will simply have to wait a bit longer before the server is up. On kiwix-serve, we open all zim files only to locate the xapian database to have multizim search. Else we would open zim file only on demand. I think we should have different strategy depending of the usage. |
@veloman-yunkan Hi, any news here? We wait to that PR to:
|
@mgautierfr The updated description contains some useful information (I think that it makes sense to copy some of it into the code). |
@mgautierfr @veloman-yunkan This is important to move on this because of the overall project(s) timeline but as well because discussing too long has not really proven to be that much beneficial in the past. So, at least for that PR, please try to be focused on the essential and would be great if we don't have days of latencies beetween each exchange. Would be really appreciated if this could be merged this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is pretty good !!
We may move the PR's description in the commits (commit message at least, maybe code) but there is nothing to say on the code.
This case algorithm uses far less memory than my proposal of the fanout algorithm. We may have to discuss about fanout later (if we need it) but for now this PR is ok.
Here some open questions (the first one to be answered before merging) :
- The commit about not unique url (see my bellow comment on code)
- Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.
- We may introduce a cache
url
->dirent
instead of the current one (index
->dirent
) as some content may be load several times (main pages, css, ...) - Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.
Can we please discuss those in a separate ticket? |
@kelson42 @mgautierfr Do you have any idea how the failure of the Macos CI builds can be fixed? For all of my recent pushes in this PR the Macos builds failed during the Install packages step (for example, Macos (native_dyn)):
|
@veloman-yunkan Not really, I have fixed one source of the problem (in git master already since a few days)... but now there is an other sympton. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve the changes.
Please rebase -i
to squash the fixup!
commits before merging and we are good.
I'm on it. |
43701f2
to
06121dd
Compare
@mgautierfr Done |
DirentLookup now has a lookup-grid which is filled on initialization with full paths (namespace + path) of equally spaced dirents. The first stage of the lookup for the given namespace+path combination happens in that in-memory lookup-grid (without access to actual dirents), providing the possible range of the search item. The second stage then continues as before by reading individual dirents and doing binary search on them. What is missing in this initial implementation is the detection of ranges that don't contain any dirents (for example, unused namespaces). One notable change brought by this commit is that the result of a search for an item in a non-existent namespace doesn't return an article index of 0. Instead, the index of the first article from the next available namespace is returned.
The size of the dirent lookup cache can be set via the ZIM_DIRENTLOOKUPCACHE environment variable.
test/data/example.zim from the kiwix/kiwix-lib repository contains two articles/items with full URL '-/j/js_modules/jsConfigVars.js'.
Reduced memory usage by zim::NarrowDown via compact storage of string content and avoiding std::string objects.
06121dd
to
d6d5b22
Compare
@mgautierfr @veloman-yunkan I think we can merge now. |
This PR addresses #410 & #424
The speed-up of dirent look-up is achieved via an auxiliary special purpose in-memory cache (new helper class
zim::NarrowDown
). It stores the full urls (ns+path) and indices of uniformly spaced dirents and is filled upon opening the ZIM file. Then the dirent look-up operation is performed in two stages:Since the speed-up is due to avoiding expensive IO operations, the goal is to fit as much data as possible in the in-memory cache. This is achieved as follows:
std::map
) data structure.std::string
objects (which have an overhead of a couple of pointers). Instead they are packed sequentially as C-style strings in a single array, and 4-byte offsets into that array are used.The idea of shortening the URLs is illustrated on an example:
The example above illustrates two consecutive entries from the wikipedia_en_climate_change_nopic_2020-01.zim ZIM file (though the index values are not real). If we were to use the item
#1234
in our cache the naive approach would be to store its full URL as is. However, let's imagine that the ZIM file also contained an item with full URL AList_of_G. Then it would have to reside between AList_of_Climate_Change_Initiatives and AList_of_Greta_Thunberg_speeches. So we can pretend that such an item exists and store in our cache the fictitious entry (AList_of_G, 1234.5). When we arrive at that entry during the range narrow-down step we must round the dirent index downward if it is going to be used as the lower bound of the range, and round it upward if it is going to be used as the upper bound of the range.Benchmark results
Since this optimization effectively adds 1024 additional (but static) dirent lookup entries, the benchmark includes two runs of the
master
version with different values for theZIM_DIRENTCACHE
setting - 512 (the default) and 1536 (default + 1024).In summary, this optimization improves the performance of
zimcheck
more than simply increasing the dirent cache size does, and doesn't negatively impact the performance ofkiwix-serve
.Data used for benchmarking:
zimcheck -A
benchmarkkiwix-serve
benchmarkThe benchmark was performed using the
benchmark_kiwix_serve
script as follows:That script starts a kiwix server and fetches from it about 1000 articles using
wget --recursive -l 1
.(It was verified that the output directories created by wget from both runs are identical, with the exception of
/random
)