Logo Search packages:      
Sourcecode: zfs-fuse version File versions  Download package

Main Page




There is a nuance in the behaviour of the umem port compared with umem on Solaris.

On Linux umem will not return memory back to the OS until umem fails to allocate a chunk. On failure, umem_reap() will be called automatically, to return memory to the OS. If your code is going to be running for a long time on Linux and mixes calls to different memory allocators (e.g.: malloc()) and umem, your code will need to call umem_reap() periodically.

This doesn't happen on Solaris, because malloc is replaced with umem calls, meaning that umem_reap() is called automatically.





 based on usr/src/uts/common/os/kmem.c r1.64 from 2001/12/18

 The slab allocator, as described in the following two papers:

      Jeff Bonwick,
      The Slab Allocator: An Object-Caching Kernel Memory Allocator.
      Proceedings of the Summer 1994 Usenix Conference.
      Available as /shared/sac/PSARC/1994/028/materials/kmem.pdf.

      Jeff Bonwick and Jonathan Adams,
      Magazines and vmem: Extending the Slab Allocator to Many CPUs and
      Arbitrary Resources.
      Proceedings of the 2001 Usenix Conference.
      Available as /shared/sac/PSARC/2000/550/materials/vmem.pdf.

 1. Overview
 umem is very close to kmem in implementation.  There are four major
 areas of divergence:

      * Initialization

      * CPU handling

      * umem_update()


 2. Initialization
 kmem is initialized early on in boot, and knows that no one will call
 into it before it is ready.  umem does not have these luxuries. Instead,
 initialization is divided into two phases:

      * library initialization, and

      * first use

 umem's full initialization happens at the time of the first allocation
 request (via malloc() and friends, umem_alloc(), or umem_zalloc()),
 or the first call to umem_cache_create().

 umem_free(), and umem_cache_alloc() do not require special handling,
 since the only way to get valid arguments for them is to successfully
 call a function from the first group.

 2.1. Library Initialization: umem_startup()
 umem_startup() is libumem.so's .init section.  It calls pthread_atfork()
 to install the handlers necessary for umem's Fork1-Safety.  Because of
 race condition issues, all other pre-umem_init() initialization is done
 statically (i.e. by the dynamic linker).

 For standalone use, umem_startup() returns everything to its initial

 2.2. First use: umem_init()
 The first time any memory allocation function is used, we have to
 create the backing caches and vmem arenas which are needed for it.
 umem_init() is the central point for that task.  When it completes,
 umem_ready is either UMEM_READY (all set) or UMEM_READY_INIT_FAILED (unable
 to initialize, probably due to lack of memory).

 There are four different paths from which umem_init() is called:

      * from umem_alloc() or umem_zalloc(), with 0 < size < UMEM_MAXBUF,

      * from umem_alloc() or umem_zalloc(), with size > UMEM_MAXBUF,

      * from umem_cache_create(), and

      * from memalign(), with align > UMEM_ALIGN.

 The last three just check if umem is initialized, and call umem_init()
 if it is not.  For performance reasons, the first case is more complicated.

 2.2.1. umem_alloc()/umem_zalloc(), with 0 < size < UMEM_MAXBUF
 In this case, umem_cache_alloc(&umem_null_cache, ...) is called.
 There is special case code in which causes any allocation on
 &umem_null_cache to fail by returning (NULL), regardless of the
 flags argument.

 So umem_cache_alloc() returns NULL, and umem_alloc()/umem_zalloc() call
 umem_alloc_retry().  umem_alloc_retry() sees that the allocation
 was agains &umem_null_cache, and calls umem_init().

 If initialization is successful, umem_alloc_retry() returns 1, which
 causes umem_alloc()/umem_zalloc() to start over, which causes it to load
 the (now valid) cache pointer from umem_alloc_table.

 2.2.2. Dealing with race conditions
 There are a couple race conditions resulting from the initialization
 code that we have to guard against:

      * In umem_cache_create(), there is a special UMC_INTERNAL cflag
      that is passed for caches created during initialization.  It
      is illegal for a user to try to create a UMC_INTERNAL cache.
      This allows initialization to proceed, but any other
      umem_cache_create()s will block by calling umem_init().

      * Since umem_null_cache has a 1-element cache_cpu, it's cache_cpu_mask
      is always zero.  umem_cache_alloc uses cp->cache_cpu_mask to
      mask the cpu number.  This prevents a race between grabbing a
      cache pointer out of umem_alloc_table and growing the cpu array.

 3. CPU handling
 kmem uses the CPU's sequence number to determine which "cpu cache" to
 use for an allocation.  Currently, there is no way to get the sequence
 number in userspace.

 umem keeps track of cpu information in umem_cpus, an array of umem_max_ncpus
 umem_cpu_t structures.  CURCPU() is a a "hint" function, which we then mask
 with either umem_cpu_mask or cp->cache_cpu_mask to find the actual "cpu" id.
 The mechanics of this is all in the CPU(mask) macro.

 Currently, umem uses _lwp_self() as its hint.

 4. The update thread
 kmem uses a task queue, kmem_taskq, to do periodic maintenance on
 every kmem cache.  vmem has a periodic timeout for hash table resizing.
 The kmem_taskq also provides a separate context for kmem_cache_reap()'s
 to be done in, avoiding issues of the context of kmem_reap() callers.

 Instead, umem has the concept of "updates", which are asynchronous requests
 for work attached to single caches.  All caches with pending work are
 on a doubly linked list rooted at the umem_null_cache.  All update state
 is protected by the umem_update_lock mutex, and the umem_update_cv is used
 for notification between threads.

 4.1. Cache states with regards to updates
 A given cache is in one of three states:

 Inactive         cache_uflags is zero, cache_u{next,prev} are NULL

 Work Requested   cache_uflags is non-zero (but UMU_ACTIVE is not set),
                  cache_u{next,prev} link the cache onto the global
                  update list

 Active           cache_uflags has UMU_ACTIVE set, cache_u{next,prev}
                  are NULL, and either umem_update_thr or
                  umem_st_update_thr are actively doing work on the

 An update can be added to any cache in any state -- if the cache is
 Inactive, it transitions to being Work Requested.  If the cache is
 Active, the worker will notice the new update and act on it before
 transitioning the cache to the Inactive state.

 If a cache is in the Active state, UMU_NOTIFY can be set, which asks
 the worker to broadcast the umem_update_cv when it has finished.

 4.2. Update interface
 umem_add_update() adds an update to a particular cache.
 umem_updateall() adds an update to all caches.
 umem_remove_updates() returns a cache to the Inactive state.

 umem_process_updates() process all caches in the Work Requested state.

 4.3. Reaping
 When umem_reap() is called (at the time of heap growth), it schedule
 UMU_REAP updates on every cache.  It then checks to see if the update
 thread exists (umem_update_thr != 0).  If it is, it broadcasts
 the umem_update_cv to wake the update thread up, and returns.

 If the update thread does not exist (umem_update_thr == 0), and the
 program currently has multiple threads, umem_reap() attempts to create
 a new update thread.

 If the process is not multithreaded, or the creation fails, umem_reap()
 calls umem_st_update() to do an inline update.

 4.4. The update thread
 The update thread spends most of its time in cond_timedwait() on the
 umem_update_cv.  It wakes up under two conditions:

      * The timedwait times out, in which case it needs to run a global
      update, or

      * someone cond_broadcast(3THR)s the umem_update_cv, in which case
      it needs to check if there are any caches in the Work Requested

 When it is time for another global update, umem calls umem_cache_update()
 on every cache, then calls vmem_update(), which tunes the vmem structures.
 umem_cache_update() can request further work using umem_add_update().

 After any work from the global update completes, the update timer is
 reset to umem_reap_interval seconds in the future.  This makes the
 updates self-throttling.

 Reaps are similarly self-throttling.  After a UMU_REAP update has
 been scheduled on all caches, umem_reap() sets a flag and wakes up the
 update thread.  The update thread notices the flag, and resets the
 reap state.

 4.5. Inline updates
 If the update thread is not running, umem_st_update() is used instead.  It
 immediately does a global update (as above), then calls
 umem_process_updates() to process both the reaps that umem_reap() added and
 any work generated by the global update.  Afterwards, it resets the reap

 While the umem_st_update() is running, umem_st_update_thr holds the thread
 id of the thread performing the update.

 4.6. Updates and fork1()
 umem has fork1() pre- and post-handlers which lock up (and release) every
 mutex in every cache.  They also lock up the umem_update_lock.  Since
 fork1() only copies over a single lwp, other threads (including the update
 thread) could have been actively using a cache in the parent.  This
 can lead to inconsistencies in the child process.

 Because we locked all of the mutexes, the only possible inconsistancies are:

      * a umem_cache_alloc() could leak its buffer.

      * a caller of umem_depot_alloc() could leak a magazine, and all the
      buffers contained in it.

      * a cache could be in the Active update state.  In the child, there
      would be no thread actually working on it.

      * a umem_hash_rescale() could leak the new hash table.

      * a umem_magazine_resize() could be in progress.

      * a umem_reap() could be in progress.

 The memory leaks we can't do anything about.  umem_release_child() resets
 the update state, moves any caches in the Active state to the Work Requested
 state.  This might cause some updates to be re-run, but UMU_REAP and
 UMU_HASH_RESCALE are effectively idempotent, and the worst that can
 happen from umem_magazine_resize() is resizing the magazine twice in close

 Much of the cleanup in umem_release_child() is skipped if
 umem_st_update_thr == thr_self().  This is so that applications which call
 fork1() from a cache callback does not break.  Needless to say, any such
 application is tremendously broken.

 Allocations against kmem and vmem have two basic modes:  SLEEP and
 NOSLEEP.  A sleeping allocation is will go to sleep (waiting for
 more memory) instead of failing (returning NULL).

 SLEEP allocations presume an extremely multithreaded model, with
 a lot of allocation and deallocation activity.  umem cannot presume
 that its clients have any particular type of behavior.  Instead,
 it provides two types of allocations:

      * UMEM_DEFAULT, equivalent to KM_NOSLEEP (i.e. return NULL on

      * UMEM_NOFAIL, which, on failure, calls an optional callback
      (registered with umem_nofail_callback()).

 The callback is invoked with no locks held, and can do an arbitrary
 amount of work.  It then has a choice between:

      * Returning UMEM_CALLBACK_RETRY, which will cause the allocation
      to be restarted.

      * Returning UMEM_CALLBACK_EXIT(status), which will cause exit(2)
      to be invoked with status.  If multiple threads attempt to do
      this simultaneously, only one will call exit(2).

      * Doing some kind of non-local exit (thr_exit(3thr), longjmp(3C),

 The default callback returns UMEM_CALLBACK_EXIT(255).

 To have these callbacks without risk of state corruption (in the case of
 a non-local exit), we have to ensure that the callbacks get invoked
 close to the original allocation, with no inconsistent state or held
 locks.  The following steps are taken:

      * All invocations of vmem are VM_NOSLEEP.

      * All constructor callbacks (which can themselves to allocations)
      are passed UMEM_DEFAULT as their required allocation argument.  This
      way, the constructor will fail, allowing the highest-level allocation
      invoke the nofail callback.

      If a constructor callback _does_ do a UMEM_NOFAIL allocation, and
      the nofail callback does a non-local exit, we will leak the
      partially-constructed buffer.

Generated by  Doxygen 1.6.0   Back to index