Idea Factory: Moved to https://znvy.substack.com

Friday, September 30, 2011

Cacheismo Memory Allocation

Cacheismo uses two different kind of memory allocators. One is called chunkpool which is used to store the items and another is called fallocator which is used for request processing. Both have different design goals. chunkpool is designed to be reasonably fast but very compact. It tries to minimize memory wastage. fallocator on the other hand doesn't care about memory wastage but tries to be very fast. Here are the details.

fallocator
fallocator is used when we are allocating memory temporarily and it is certain that it will be freed very soon. For example when reading data from socket and parsing the request, many small strings are generated but these are released as soon as the request is over, which could be a matter of micro seconds. Cacheismo creates a fallocator per connection. fallocator not only allocates memory but also keeps track of memory. Thus if because of bad code we miss freeing some memory, it will be reclaimed when the fallocator is destroyed, essentially when the connection is closed.

fallocator internally allocates a big buffer (4KB) and keeps track of how much of the buffer is used. Malloc simply returns buffer+used at the pointer and increments used by size of malloc. If the current buffer cannot handle the request, new buffer is allocated and memory given from new buffer. Used buffers are kept in a linked list. On free, fallocator decreases the refcount on the buffer which was used to allocate that memory. If the refcount becomes zero, the buffer is freed from the used buffers list. The benefit of this refcount based memory management is that it keeps memory usage to minimum. So if you are doing some malloc/frees in a loop, they will not exhaust the memory because refcount mechanism will ensure that they are released. If the malloc size is bigger than 4KB, fallocator defaults to malloc but keeps track of the buffer.

Moreover the 4KB buffer are cached to avoid going to malloc every time. This scheme of things give us O(1) malloc and O(1) free in normal case (when not going to system malloc). It avoids fragmentation of the memory but could result in more memory being used than required at any point in time. It also gives us some protection against memory leaks.

chunkpool
chunkpool is used for storing the cacheitems and the goal here is to maximize memory utilization. chunkpool imposes two constrains on the application. One, it never allocates more than 4KB of contiguous memory. That is, if the application needs 8KB of memory, it can at max only get 2 4KB blocks and not a single 8KB block. Applications needs to be designed around this limitation. The second one is not a constraint, but an advice. Apart from malloc, chunkpool exposes another function called relaxedMalloc which can return memory smaller than what application asked for. Thus if application asks for 4KB of memory, it might return 256 bytes or 1KB if it doesn't have a 4KB block. If the returned memory is too small, application can free it and deal with out of memory condition or if possible use large number of small buffers instead of small number of large buffers.

Because of these restrictions, fragmentation of memory doesn't makes the system unusable, but only adds more memory management overhead. Internally we have slabs of size 16 + n16 upto 4096 bytes. Initially all memory is allocated to the 4KB slab. If applications asks for say 1KB of memory we split the 4Kb buffer into two buffer on of 3Kb and another of 1Kb. The 1 KB buffer goes to the application and the 3KB is stored in the 3Kb slab. Cacheismo chunkpool uses skiplist to keep track of non empty slabs. At runtime if we found that the slab of asked malloc size is empty, it quickly find the next not empty slab, splits the buffer and returns memory to the application. On free, chunkpool tries to merge the free buffers to create bigger buffers. One limitation that I will fix later is that chunkpool memory uses the first four bytes for memory management. The implication is that any free buffer can only merge with logical next buffer and not logical previous buffer. This would require storing buffer size also at the end of the buffer and not just at the start. To circumvent this problem, chunkpool requires an extra GC call where it tries to merge 8MB of pages at a time. This is called once every second. This is triggered only if at least 12% of memory is free and average chunk size is less than 256 bytes other wise it is a no-op. Even though chunkpool is capable of addressing upto 64GB of memory, it is advisable not to go beyond say 8GB.

chunkpool is a very efficient allocation manager for cache data, when compared with slab allocator. It saves user the trouble of first thinking about what slab sizes should be used and maintains high HIT Rate when object sizes are changed because of changes in application code. Cacheismo is not faster than memcached when comparing raw TPS, but when you take HIT Rate into account, it saves you lots of db accesses and increases overall application performance. You can download it from here.

Wednesday, September 28, 2011

Cacheismo Cluster Update

I have started work on cacheismo cluster. I should have something working in next couple of days. The interface for lua scripts to get values from the cluster would need small change. Instead of only specifying the key, the caller will need to specify both dataKey as well as the virtualKey when looking up for a value. Cacheismo will use the dataKey to decide the server to which the virtualKey would be sent so that we don't end up in a endless loop. For dataKey's lookup is anyways simpler.

The cacheismo cluster specification (set of servers in the cluster) would be exposed via special virtual keys, so that it can be updated just like other objects. Each of the servers will monitor all other servers in the cluster and update its consistent hashing table so that requests are routed appropriately. This places a restriction on the number of servers we can add in a cluster. Maybe at some point what we will need is neste or multi-level consistent hashing where a special node will stand for multiple other nodes, something like a router which connects one network with another.

UPDATE: Cacheismo Cluster Update 2

Monday, September 26, 2011

Cacheismo Next Steps

Most important limitation of current approach is that cacheismo cannot be used in clustered mode. The reason being virtual key hashing might choose a server different from where the actual key is stored.
hash("set:new:myKey") is not equal to hash("set$myKey") is not equal to hash("set:count:myKey")

I did some experiments with lua threads and they look pretty cheap to create. The current plan is to make cacheismo nodes aware of each other and add client side support to cacheismo server. Instead of just looking up the key in local server memory, cacheismo will also do the hashing at server side on the actual key and fetch the results from the responsible cacheismo server. This would double the latency but ensure that any virtual key can be thrown at any server and it will fetch the results.

For example consider something like "set:intersection:mySet1:mySet2:mySet3".
Lets assume that mySet1, mySet2 and mySet3 are stored on different servers.
Trivial implementation would be to get mySet1, mySet2 and mySet3 to server executing the request and do the processing. Another possibility is to get sizes of mySet1, mySet2 and mySet3 and execute two parallel "set:intersection:smallest:other1" and "set:intersection:smallest:other2", followed by local intersection over the result.

Since cacheismo is scriptable, it is always possible to change this logic. I will hide the non-blocking details using lua co-routines so writing such objects will not require "knowing" the asynchronous IO aspects, just some friendly function call like getValueForKey.

It might make it possible to do some in memory map-reduce with recursive calls for virtual keys running parallel stacks spread over the network, executing at the same time.

UPDATE: Cacheismo Cluster Update

Saturday, September 24, 2011

Introducing Cacheismo



Cacheismo is scriptable object cache which can be used as a replacement for memcached. It supports memcache protocol (tcp and ascii only). It   automatically manages slab sizes, no user configuration is required.    Slabs are intelligently managed and they reconfigure themselves as per  the usage. PUT never ever fails. Details follow:     
- Efficient use of memory 

  Slab allocator used in memcached would waste memory if slab sizes
  and object sizes are not in sync. For example when using slab sizes
  of 64/128/256/512/1024 bytes object with size 513 is stored in 1024
  bytes slab, wasting almost half the memory. Cacheismo tries to 
  maintain memory used by object as close as possible to the object size

- Reclaimable Memory 

  Every byte is reclaimable in cacheismo. PUT never fails. You will
  never have slab full issues. More on memory management.

- True LRU support. Dependable cache

  Memcached will evict items from the slab in which object will be        placed. Cacheismo  will always use the LRU, irrespective of the size.   This ensures that the most  important data is always in the cache. No   surprises. If something is not in the cache, it either expired or was   least used.

- Server Side Scriptable objects

  The best part of using cacheismo is that is it fully scriptable
  in lua. Instead of cluttering your code with workarounds for 
  memcached read/write semantics, you can move that logic inside 
  cacheismo. 
  See Use cases for cacheismo 
      Rate Limiting Quota With Cacheismo
      Using Cacheismo Cluster For Real-time Analytics

  Sample objects map, set, quota and sliding window counters
  written in lua can be found in scripts directory. You could create 
  your own objects and place them in the scripts directory. 
  
  The interface for accessing scriptable objects is implemented via
  memcached get requests. For example: 
  get set:new:mykey     
  would create a new set object refered via myKey
  get set:put:myKey:a   
  would put key a in the set myKey
  get set:count:myKey   
  would return number of elements in the set
  get set:union:myKey1:myKey2   
  would return union of sets myKey1 and myKey2
  
  See scripts/set.lua for other functions.  
  
  I call this approach virtual keys. Reason for using it ...
  You can use existing memcached libraries to access cacheismo. 
  
Building - cacheismo depends on libevent and lua (luagit can 
also be used). Complete implementation is in about 5000 lines 
of code and fairly organised. 



Blue line is (memcachedTPS-cacheismoTPS)/memcachedTPS. This           essentially measures how much faster is memcached compared to         cacheismo. As you can see at max it is 20% faster.

Black line is (cacheismoHitRATE-memcachedHitRATE)/cacheismoHitRATE.   This essentially means how much memory efficient is cacheismo compared to memcached. For the first half of the test case we don't see any   difference. But in the second half cacheismo is at max about 80%      efficient. The reason for this behavior is that memcached can't       reassign slabs once they are used whereas cacheismo continues to      adjust with changing object sizes. Over time memcached commits memory to some slabs and hence its hit rate decreases over time.

Get source from GitHub

Google Group

UPDATE: Cacheismo Next Steps

Friday, September 23, 2011

What to do when your code suddenly slows down by 60%?

before running the tests multiple times to confirm that this indeed is the case...
before looking into the recent changes...
before looking into the test code...
before checking if swap is being used...
before checking if logging is enabled...
before doubting the kernel upgrade...
before using strace -c to find out unusual system calls...
before using valgrind --tool=callgrind to profile the code...

The first thing to do is to check cat /proc/cpuinfo and check in cpu Mz is same as advertised cpu speed. Laptops usually run in power saving mode. Not sure what is wrong with 64 bit Ubuntu 11.04, running on Lenovo Thinkpad T410. Even with power plugged in, it decided to run at 933Mz instead of 2.53 Gz.
Thanks to google for this link.