|
| 1 | +mcdb - fast, reliable, simple code to create and read constant databases |
| 2 | + |
| 3 | +mcdb enhancements to cdb (on which mcdb is based; see References section below) |
| 4 | +- updated to C99 and POSIX.1-2001 (not available/portable when djb wrote cdb) |
| 5 | +- optimized for mmap access to constant db (and avoid double buffering) |
| 6 | +- redesigned for use in threaded programs (thread-safe interface available) |
| 7 | +- convenience routines to check for updated constant db and to refresh mmap |
| 8 | +- support cdb > 4 GB with 64-bit program (required to mmap() mcdb > 4 GB) |
| 9 | +- 64-bit safe (for use in 64-bit programs) |
| 10 | + |
| 11 | +Advantages over external database |
| 12 | +- performance: better; avoids context switch to external database process |
| 13 | +Advantages over specialized hash map |
| 14 | +- generic, reusable |
| 15 | +- maintained (created and verified) externally from process (less overhead) |
| 16 | +- shared across processes (though shared-memory could be used for hash map) |
| 17 | +- read-only (though memory pages could also be marked read-only for hash map) |
| 18 | +Disadvantages to specialized hash map |
| 19 | +- performance: slightly slower than specialized hash map |
| 20 | +Disadvantages to djb cdb |
| 21 | +- mmap requires address space be available into which to mmap the const db |
| 22 | + (i.e. large const db might fail to mmap into 32-bit process) |
| 23 | +- mmap page alignment requirements and use of address space limits const db |
| 24 | + max size when created by 32-bit process. Sizes approaching 4 GB may fail. |
| 25 | +- arbitrary limit of each key or data set to (INT_MAX - 8 bytes; almost 2 GB) |
| 26 | + (djb cdb doc states there is no limit besides cdb fitting into 4 GB) |
| 27 | + (writev() on some platforms in 32-bit exe might also have 2 GB limit) |
| 28 | + |
| 29 | +Incompatibilities with djb cdb |
| 30 | +- padding added at the end of key,value data to 16-byte align hash tables |
| 31 | + (incompatible with djb cdbdump) |
| 32 | +- initial table and hash tables have 8-byte values instead of 4-byte values |
| 33 | + in order to support cdb > 4 GB. cdb uses 24 bytes per record plus 2048, |
| 34 | + whereas mcdb uses 24 bytes per record plus 4096 when data section < 4 GB, |
| 35 | + and mcdb uses 40 bytes per record plus 4096 when data section >= 4 GB. |
| 36 | +- packing of integral lengths into char strings is done big-endian for |
| 37 | + performance in packing/unpacking integer data in 4-byte (or better) |
| 38 | + aligned addresses. (incompatible with all djb cdb* tools and cdb's) |
| 39 | + (djb cdb documents all 32-bit quantities stored in little-endian form) |
| 40 | + Memory load latency is limiting factor, not the x86 assembly instruction |
| 41 | + to convert uint32_t to and from big-endian (when data is 4-byte aligned). |
| 42 | + |
| 43 | +Limitations |
| 44 | +- 2 billion keys |
| 45 | + As long as djb hash is 32-bit, mcdb_make.c limits number of hash keys to |
| 46 | + 2 billion. cdb handles hash collisions, but there is a small expense each |
| 47 | + collision. As the key space becomes denser within the 2 billion, there is |
| 48 | + greater chance of collisions. Input strings also affect this probability, |
| 49 | + as do the sizes of the hash tables. |
| 50 | +- process must mmap() entire mcdb |
| 51 | + Each mcdb is mmap()d in its entirety into the address space. For 32-bit |
| 52 | + programs that means there is a 4 GB limit on size of mcdb, minus address |
| 53 | + space used by the program (including stack, heap, shared libraries, shmat |
| 54 | + and other mmaps, etc). Compile and link 64-bit to remove this limitation. |
| 55 | + |
| 56 | + |
| 57 | +References |
| 58 | +---------- |
| 59 | + Dan Bernstein's (djb) reference implementation of cdb (public domain) |
| 60 | + http://cr.yp.to/cdb.html |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | +There is plenty more information which will eventually (hopefully) be here. |
| 65 | +In the meantime, here are some snippets, and more will come after the |
| 66 | +initial release, as time allows. |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +Technical Asides |
| 71 | +---------------- |
| 72 | + |
| 73 | + |
| 74 | +mcdbctl creates databases with limited permissions (0400) |
| 75 | +--------------------------------------------------------- |
| 76 | +mcdbctl creates new databases with limited permissions (0400), important for |
| 77 | +security when creating mcdb for sensitive data, such as /etc/shadow data. |
| 78 | +mcdbctl recreates existing databases and preserves the permission modes that |
| 79 | +were applied to the previous mcdb (replaced by the recreated mcdb). |
| 80 | + |
| 81 | +mcdb performant design tidbits |
| 82 | +------------------------------ |
| 83 | +Data locality is one of the primary keys to performance: djb cdb linearly |
| 84 | +probed open hash tables minimize arbitrary jumps around memory. |
| 85 | +Endian tranformations are absorbed in the noise of memory load latency. |
| 86 | + |
| 87 | +An immediate performance boost can be seen when using mcdb for nsswitch |
| 88 | +databases on machines that run Apache with suexec which was the reason behind |
| 89 | +using cdb, and now mcdb, for passwd and group databases. See INSTALL file. |
| 90 | + |
| 91 | +smallest mcdb |
| 92 | +------------- |
| 93 | +Empty mcdb with no keys is 4K |
| 94 | +$ echo | ./mcdbctl make empty.mcdb - |
| 95 | + |
| 96 | +mcdb supports empty keys and values |
| 97 | +----------------------------------- |
| 98 | +$ echo -e "+0,0:->\n" | ./mcdbctl make empty.mcdb - |
| 99 | +In practice, empty values can be useful to when interested only in existence, |
| 100 | +or not, of keys. However, more than one empty key is not generically useful, |
| 101 | +(or a use case is not immediately apparent to me). |
| 102 | + |
| 103 | +mcdb supports one-char tag prefix on keys |
| 104 | +----------------------------------------- |
| 105 | +One method of storing multiple types of constant data in a single mcdb is |
| 106 | +to prefix the key with a single character that indicate the type of data in |
| 107 | +the key. This feature is used by nss_mcdb for various nsswitch.conf databases. |
| 108 | +For example, passwd databaes can be queried by name or by uid. Both are stored |
| 109 | +in the same mcdb database, with a different character prefixing the keys for |
| 110 | +names versus the keys for uids. This prefix tag character feature of mcdb |
| 111 | +allows for tagged-key queries without the extra work of having to create a |
| 112 | +temporary buffer to set the tag followed by a copy of the key. |
| 113 | + |
| 114 | +mcdb creation speed |
| 115 | +------------------- |
| 116 | +mcdb creation supports the same input format as cdb. There is an overhead to |
| 117 | +formating the input as ASCII numbers, as well as parsing and translating the |
| 118 | +ASCII back to native numbers. (The cost adds up with lots of keys.) Another |
| 119 | +method to create mcdb which is even faster is for a program to link with and |
| 120 | +call mcdb_make_* routines directly, as is done in nss_mcdbctl. |
| 121 | + |
| 122 | +mcdb in memory only |
| 123 | +------------------- |
| 124 | +Creating an mcdb in memory without filesystem backing is possible. This might |
| 125 | +be useful for testing, but in practice, a hash function customized to the |
| 126 | +specific purpose at hand would be faster. Setting the fd to -1 in the call to |
| 127 | +mcdb_make_start() is how to tell mcdb routines the mcdb is not backed by the |
| 128 | +filesystem. Then, caller must create mmap large enough for all keys and data |
| 129 | +(filling in nkeys (num keys), total_klen (key len), and total_dlen (data len)). |
| 130 | + struct mcdb_make m; |
| 131 | + size_t msz; |
| 132 | + mcdb_make_start(&m, -1, malloc, free); |
| 133 | + /* preallocated mcdb mmap to proper full size; (msz+15) & ~15 for alignment */ |
| 134 | + msz = (MCDB_HEADER_SZ + nkeys*8 + total_klen + total_dlen + 15) & ~15; |
| 135 | + msz+= (msz < UINT_MAX ? nkeys*16 : nkeys*32); |
| 136 | + m.fsz = m.msz = (msz = (msz + ~m.pgalign) & m.pgalign); /*align to page size*/ |
| 137 | + m.map = (char *)mmap(0, msz, PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); |
| 138 | + if (m.map == MAP_FAILED) return false; |
| 139 | + /* ... loop and mcdb_make_add(), then mcdb_make_finish() */ |
| 140 | + /* ... when finished using the mmap, then munmap(m.map, m.msz) */ |
| 141 | + |
| 142 | +compiler intrinsics/builtins not (yet) tested on all platforms |
| 143 | +-------------------------------------------------------------- |
| 144 | +The compiler intrinsics/builtins in code_attributes.h have been tested on i686 |
| 145 | +and x86_64 platforms, but have not (yet) been tested on all other platforms. |
| 146 | +They were written from vendor documentation available on the internet, but I do |
| 147 | +not have private access to all those platforms in order to test. Please send |
| 148 | +to me advice, suggestions, and corrections. |
0 commit comments