TLDR: DEFLATE decompressor in 3K of RAM
For a Pebble app I've been writing, I need to send images from the phone to
the watch and cache them in persistent storage on the watch. Since the
persistent storage is very limited (and the Bluetooth connection is relatively
slow) I need these to be as small as possible, and so my original plan was to
use the PNG format and
gbitmap_create_from_png_data()
. However, I
discovered that this function is not supported on the earlier firmware used by
the Pebble Classic. Since PNGs are essentially DEFLATE compressed bitmaps, my
next approach was to manually compress the bitmap data. This meant that I
needed a decompressor implementation ("inflater") on the watch.
The constraint
The major constraint for Pebble watchapps is memory. On Pebble Classic apps
have 24K of RAM available for the compiled code (.text
), global and static
variables (.data
and .bss
) and heap (malloc()
). There is an additional 2K
for the stack (local variables). The decompressor implementation needed to have
both small code size and variable usage. I discovered tinf which seemed to
fit the bill, and tried to get it working.
Initially, trying to decompress something simply crashed the app. It took some
debug prints to determine that code in tinf_uncompress()
wasn't even being
executed, and I realised that it was exceeding the 2K stack limit. I changed
the TINF_DATA
struct to be allocated on the heap to get past this. At this
stage it was using 1.2K of .text
, 1.4K of .bss
, 1K of stack, and 1.2K of
heap (total 4.8K). I set about optimising the implementation for memory usage.
Huffman trees
Huffman coding is a method to represent frequently used symbols with fewer bits. It uses a tree (otherwise referred to as a dictionary) to convert symbols to bits and vice versa. DEFLATE can use Huffman coding in two modes: dynamic and fixed. In dynamic mode, the compressor constructs an optimal tree based on the data being compressed. This results in the smallest representation of the actual input data; however, it has to include the computed tree in the output in order for a decompressor to know how to decode the data. In some cases the space used to serialise the tree negates the improvement in the input representation. In this case the compressor can used fixed mode, where it uses a static tree defined by the DEFLATE spec. Since the decompressor knows what this static tree is, it doesn't need to be serialised in the output.
The original tinf implementation builds this fixed tree in tinf_init()
and
caches it in global variables. Whenever it encounters a block using the fixed
tree it has the tree immediately available. This makes sense when you have
memory to spare, but in this case we can make another tradeoff. Instead we can
store the fixed tree in the same space used for the dynamic tree, and rebuild
it every time it is needed. This saves 1.2K of .bss
at the expense of some
additional CPU usage.
The dynamic trees are themselves serialised using Huffman encoding (yo dawg).
tinf_decode_trees()
needs to first build the code tree used to deserialise
the dynamic tree, which the original implementation loads into a local variable
on the stack. There is an intermediate step between the code tree and dynamic
tree however (the bit length array), and so we can borrow the space for the
dynamic instead of using a new local variable. This saves 0.6K of stack.
The result
With the stack saving I was able to move the heap allocation back to the stack.
(Since the stack memory can't be used for anything else it's kind of free
because it allows the non-stack memory to be used for something else.) The end
result is 1.2K of .text
, 0.2K of .bss
and 1.6K of stack (total 3.0K), with
only 1.4K counting against the 24K limit. That stack usage is pretty tight
though (trying to use app_log()
inside tinf causes a crash) and is going to
depend on the caller using limited stack. My modified implementation will
allocate 1.2K on the heap by default, unless you define TINF_NO_MALLOC
. Using
zlib or gzip adds 0.4K of .text
. You can find the code on bitbucket.