Embedding Blobs in Binaries

I was recently looking for a way to ship an image with an executable without referring to the image as external file. Now you can argue if it is a good idea in general, but that's another story.

Since I found very little information on the subject and needed to piece together information from various sources, I conclude the task with this comprehensive write up on the subject. The mechanisms described here apply to including any resource - not just images - into an executable c-program.

Object Files

There are various ways to embed generic fixed data within an application. Let's start with a basic example that should be familiar:

const char* mystring = "hello world";

Every application is just a collection of Data segments stored in a file. On the modern PC architecture, every application binary consists of three major segments Stack, Data and Code. Without going into details, fixed data - such as constant text-strings reside in the read-only part of the Data Segment of the application binary and the c-compiler will automatically create a memory region for it.

When you compile the above line with gcc -c -o mytext.o mytext.c and inspect the object file with objdump -t mytext.o you'll find the data-section in the object file.

00000000 l    d  .text        00000000 .text
00000000 l    d  .data        00000000 .data
00000000 l    d  .rodata      00000000 .rodata
00000000 g     O .data        00000004 mystring

which basically says: “mystring' is a read-only global (g) text object (O) and can be found at offset 0x00 in the object-file. For a detailed description of all the values see man objdump.

C Include method

Now one solution would be to simply write the code in c and have the compiler take care of everything by using

const unsigned char* myimage = "BIG BLOCK OF RESOURCE DATA ENCODED AS STRING";

To encode the data of a given file, the xxd tool that comes with the vim text editor can be used: xxd -i binary_file outputs C include file style of the given binary file and writes a complete static array definition named after the input file:

# cat /tmp/example
Hello World
# xxd -i /tmp/example
unsigned char _tmp_example[] = {
  0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x0a
};
unsigned int _tmp_example_len = 12;

Using this variant produces portable C code and just works(TM). On the downside, it requires at least five times the size of the original data (” 0xNN,” for every original data-byte) and the source code needs to be updated every time the original data changes.

While this approach is useful in some cases, we can do better than that.

Binary linking

The GNU linker can be used to directly create object files with a custom .data section directly from any input file. The /magic/ flags are -r to make the object file relocatable and -b binary for linking files with an unusual binary format.

# ld -r -b binary -o example.o example.jpg

The resulting object file example.o can be linked with any application, simply with gcc itself. e.g. gcc -o myapp myapp.c example.o. Now the last missing link is to access the data in example.o from the c-code in myapp.c. Have a look at the output of the object and compare it with above output for the mystring object.

# objdump -t example.o

SYMBOL TABLE:
00000000 l    d  .data  00000000 .data
00004fc0 g       .data  00000000 _binary_example_jpg_end
00004fc0 g       *ABS*  00000000 _binary_example_jpg_size
00000000 g       .data  00000000 _binary_example_jpg_start

The ”[..] g .data […] <name>” part should ring a bell. This data-section can be referenced from the C code simply by using:

extern const unsigned char _binary_example_jpg_start[];

and the length can be calculated from

extern const unsigned char _binary_example_jpg_start[];
extern const unsigned char _binary_example_jpg_end[];
size_t len = _binary_example_jpg_end - _binary_example_jpg_start;

or simply by referencing the address of _binary_example_jpg_size.

The linker will resolve the extern reference when linking the application and the application will use the data just like a fixed blob in the source-code.

So far so good.

Architecture dependent binary linking

Now for the tricky part: The GNU linker behaves differently depending on platform and architecture. The implementations interesting for me are GNU/Linux, OSX and mingw (cross-compiling windows binaries on a GNU/Linux host).

The mingw cross-compiler behaves almost exactly as gnu-ld with one minor difference: the data section does not include the leading underscore: _binary_example_jpg_start vs binary_example_jpg_start. – Fine, there goes some of the elegance of the solution, but that case is easily handled with an #ifdef.

However, Mac/OSX is different. The ld which is shipped with X-code comes from llvm version 2.7svn and does not support the -b input-format feature. Furthermore universal executables on OSX may comprise binary formats for various architectures with the .data section format being different for each architecture. The alignment for the data may differ between 32bit and 64bit architectures and the endianess may differ as well. Thus the creation of the data section needs to be done during compilation instead of the linking stage.

On OSX ld's binary linking feature has been moved into their customized gcc, and is available via '-sectcreate' option:

gcc -sectcreate __DATA __example_jpg example.jpg -o myapp myapp.c

To create a universal build for Intel architectures, add -arch i386 -arch x86_64 to above commandline. objdump is also a GNU tool which is not available on OSX. You can inspect the data section using otool -s __DATA __example_jpg /path/to/executable. see man otool for details there.

Due to the nature of OSX binaries, referencing the data-section in the c-code is not possible with a simple extern unsigned char. The linker does not know which architecture will be used and can not provide an address. The mach binary format which is used by OSX needs to be inspected at runtime when the architecture is known and map the relevant data after the application is started. Apple provides an API for doing that which is defined in the mach-o/getsect.h header file. If you have x-code installed you can read documentation on it at man getsectbyname.

Resolving the secion can be only be done at runtime after the data section has been relocated and can done by calling getsectbyname(). However there is a trick that you can use, to make this implicit. the meta-variable _section$ is recognized by the gcc compiler on OSX. It produces the same result as calling getsectbyname()→addr. Short of reading the actual code, information about osx linker internals is not easy to come by. getsectbyname() actually opens the executable file and searches the relevant data section while the application is running. _section$ may or may not already be resolved at link-time for a given architecture 1).

Architecture Abstraction -- working solution

Long story short, one can use a macro abstraction that works x-platform. To access the data and size LDVAR() and LDLEN() are defined, for the actual external definition of the symbol, EXTLD() is used:

#ifdef __APPLE__
#include <mach-o/getsect.h>

#define EXTLD(NAME) \
  extern const unsigned char _section$__DATA__ ## NAME [];
#define LDVAR(NAME) _section$__DATA__ ## NAME
#define LDLEN(NAME) (getsectbyname("__DATA", "__" #NAME)->size)

#elif (defined __WIN32__)  /* mingw */

#define EXTLD(NAME) \
  extern const unsigned char binary_ ## NAME ## _start[]; \
  extern const unsigned char binary_ ## NAME ## _end[];
#define LDVAR(NAME) \
  binary_ ## NAME ## _start
#define LDLEN(NAME) \
  ((binary_ ## NAME ## _end) - (binary_ ## NAME ## _start))

#else /* gnu/linux ld */

#define EXTLD(NAME) \
  extern const unsigned char _binary_ ## NAME ## _start[]; \
  extern const unsigned char _binary_ ## NAME ## _end[];
#define LDVAR(NAME) \
  _binary_ ## NAME ## _start
#define LDLEN(NAME) \
  ((_binary_ ## NAME ## _end) - (_binary_ ## NAME ## _start))
#endif

Example usage in a C Program:

// define the external variable
EXTLD(example_jpg)

void some_function() {
  // access the data
  size_t length = LDLEN(example_jpg);
  uint8_t *data = LDVAR(example_jpg);
}

As a final note, some care must be taken when choosing the variable identifier.

ld will use the filename to generated the section name. If the filename includes characters that are not valid C identifiers they will be transformed to underscores. e.g. ld -r -b binary -o example.o ../images/example.jpg will create a region _binary____images_example_jpg. The ../ as well as the slash and dot are transformed to underscores.

This is not an issue on OSX where the identifier needs to be specified with the -sectcreate option. However identifiers on OSX are limited to 16 characters.

So in order to use above approach x-platform, the path to the file-name passed to ld must be <16 chars and the same identifier needs to be specified on the OSX compile command.

A complete project that uses this approach to include a jpeg image file and a javascript text file is harvid. It also outlines how to use a x-platform Makefile for creating the object files and adding the relevant flags to the OSX gcc command.

1) should you know more about this or have any pointers to documentation, please contact me
 
wiki/embedding_resources_in_executables.txt · Last modified: 02.09.2014 20:43 by rgareus
   |