Strange Corners of C: Entering The Twilight Zone of the C Compiler
Posted on Mon 10 February 2025 in Programming
C is a fascinating programming language, simple enough to learn in a few days, yet powerful enough to build the world's most complex systems. With its lightweight runtime, C runs everywhere! From microwave ovens to spacecrafts, and everything in between. Sometimes, I can’t help but wonder: Is the universe itself written in C?
For as long as I can remember, C has been my goto
language (see what I did
there?). I first learning C to write games for DOS (remember those far
pointers?). Over the years, I have programmed in countless environment using C.
It is the language I’d like to think I’ve mastered it. Yet, time and again, I
encounter some obscure trick or unexpected behavior that shifts my
understanding of the language.
IOCCC
One of my favorite pastimes is browsing entries from the International Obfuscated C Code Contest (IOCCC). This competition has produced some of the strangest, most brilliant C programs ever written—programs that are deliberately hard to read, but often contain deep insights into the language.
Recently, I stumbled upon an entry from 1984 by Sjoerd Mullender: mullender.c.
At first glance, the code looks bizarre:
short main[] = {
277, 04735, -4129, 25, 0, 477, 1019, 0xbef, 0, 12800,
-113, 21119, 0x52d7, -1006, -7151, 0, 0x4bc, 020004,
14880, 10541, 2056, 04010, 4548, 3044, -6716, 0x9,
4407, 6, 5568, 1, -30460, 0, 0x9, 5570, 512, -30419,
0x7e82, 0760, 6, 0, 4, 02400, 15, 0, 4, 1280, 4, 0,
4, 0, 0, 0, 0x8, 0, 4, 0, ',', 0, 12, 0, 4, 0, '#',
0, 020, 0, 4, 0, 30, 0, 026, 0, 0x6176, 120, 25712,
'p', 072163, 'r', 29303, 29801, 'e'
};
I thought, "Hey isn't main()
suppose be a function?" Apparently not. By the way,
the code above will only work on a VAX-11 or a PDP-11, which I don't have access
to. I have read
that it prints ":-)
" across the screen until it is forced to stop.
Understanding The Trick
Instead of reproducing it, let’s break down the underlying trick. This entry leverages the way compilers, assemblers, and linkers handle data vs. executable code in low-level architectures. Fun fact, the IOCCC later updated its rules to prohibit machine-dependent code after 1984—likely due to programs like this.
Like most compiled languages, C doesn’t directly produce an executable. Instead, it undergoes multiple stages:
Preprocessing → Compilation → Assembly → Linking
Each stage has a specific role:
- Preprocessing: Handles
#include
files, macros, and conditional compilation. - Compilation: Translates C into assembly.
- Assembly: Converts assembly into machine code.
- Linking: Resolves symbols and produces an executable.
Inline Assembly
All basic stuff. However, compilers have evolved to add hooks into the
assembler and linker. For example, gcc
and
clang
allows ways to insert inline assembly,
enabling developers to bypass the compiler’s usual code generation. It's a way
to tell the compiler, "Trust me, I know what I’m doing. Just insert this
directly into the assembly output." For example, we can write a simple program
that outputs 42
to the shell. FYI, I run Arch Linux on an x86-64 target,
BTW.
static int foo(void) { asm volatile("mov $42, %%rax" ::: "rax", "memory"); }
int main() { return foo(); }
This moves 42
directly into the return register (rax
on x86-64).
If we want to inspect the assembly output, we can instruct the compiler to stop
after the compilation stage using the -S
flag (on gcc and clang). But why not
place the inline assembly directly inside main()
? According to the C
standard, if main()
lacks an explicit return
statement, it implicitly
returns 0
. This means our carefully crafted assembly code would execute, but
immediately be replaced by zero and its result wouldn't make it back to the shell.
Take a look at lines 13–17 (below) in the output—the compiler has inserted our assembly code exactly as written. And don't bother trying to outsmart the compiler’s optimization settings; at this level, you're venturing into the territory of undefined behavior.
$ gcc -S -o - main-1.c | nl
1 .file "main-1.c"
2 .text
3 .type foo, @function
4 foo:
5 .LFB0:
6 .cfi_startproc
7 endbr64
8 pushq %rbp
9 .cfi_def_cfa_offset 16
10 .cfi_offset 6, -16
11 movq %rsp, %rbp
12 .cfi_def_cfa_register 6
13 #APP
14 # 1 "main-1.c" 1
15 mov $42, %rax
16 # 0 "" 2
17 #NO_APP
18 nop
19 popq %rbp
20 .cfi_def_cfa 7, 8
21 ret
22 .cfi_endproc
23 .LFE0:
<snip>
"Inline Linking"
We can also bypass the assembler and go directly to the linker. But before
doing that, we need to determine the exact machine code for a simple int
main() { return 42; }
program. Instead of manually inspecting assembly, we can
cheat a little by using
binutils
.
$ cat - << __EOF__ | gcc -O2 -o - -x c - | objdump -S -
int main() { return 42; }
__EOF__
-: file format elf64-x86-64
<snip> ... skipping unnecessary details ...
0000000000001040 <main>:
1040: f3 0f 1e fa endbr64
1044: b8 2a 00 00 00 mov $0x2a,%eax
1049: c3 ret
104a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
<snip> ... skipping unnecessary details ...
We can ignore endbr64
, as it's not relevant for what we want to do. I also
don't know why it's there (something something security, I guess). The essential machine code for
return 42;
is simply moving 42
(0x2a
) into the EAX
register, followed
by a ret
instruction.
Additionally, we can disregard the remaining instructions, as they are inserted for alignment purposes. As of this writing, the Linux kernel does not enforce function boundary alignment, making them unnecessary for our purposes.
This leaves us with the following minimal machine code:
"\xb8\x2a\0\0\0\0" /* mov $0x2a, %eax */
"\xc3" /* ret */
Let's try executing our handcrafted machine code:
$ cat - << __EOF__ | gcc -x c - && ./a.out; echo $?
const char main[] = "\xb8\x2a\0\0\0\0" "\xc3";
__EOF__
[1] 119959 segmentation fault (core dumped) ./a.out
139
Wait... what? It compiled just fine, but why did it crash with a segmentation fault?
Well, at least we have "segments" to "fault" on... right, embedded devs?
The issue is that the linker, by default, places const char main[]
in the
.rodata
(read-only data) section of memory. This section is marked as
non-executable by the Linux kernel’s memory protection mechanisms. So, when
the program tries to execute code stored there, Linux sees this as a
segmentation violation and immediately kills the process.
We can instruct the linker to place our machine code in the executable code
section instead of .rodata
. On Linux, this means placing it in the .text
segment, which is designated for executable instructions. We can achieve this
directly in the compiler using the __attribute__
keyword, available in both
GCC and Clang:
$ cat - << __EOF__ | gcc -x c - && ./a.out; echo $?
__attribute__((section(".text")))
const char main[] = "\xb8\x2a\0\0\0\0" "\xc3";
__EOF__
/tmp/ccvXEHQs.s: Assembler messages:
/tmp/ccvXEHQs.s:4: Warning: ignoring changed section attributes for .text
42
Hey, there's our hand-crafted 42! 🎉
Despite the compiler warning about changing section attributes, the execution works because we successfully placed our custom machine code in an executable memory section.
More Abusive C
With this trick, we can do all sorts of things. For instance, here’s a “function” that isn’t visible outside its scope—essentially a primitive, hardcoded lambda in machine code:
int main() {
typedef int (*f)(int, int);
__attribute__((section(".text")))
static const char func_impl[] =
"\x8d\x04\x37\xc3";
return ((f)(func_impl))(40, 2);
}
However, this approach comes with several limitations. For example:
- We can’t call other C functions because we’d need their exact memory addresses.
- We can’t access stack memory, so all variables must be stored in registers.
- We could have variables, but they’d effectively behave like
static
variables, meaning no thread safety.
That said, these restrictions won’t stop me from writing a "Hello, World\n"
program using raw machine code. (Oh, and yes, I use Arch Linux on x86-64,
BTW. 😏)
__attribute__((section(".text")))
const char main[] =
/* mov eax, 1 */
"\xb8\x01\0\0\0"
/* mov edi, 1 */
"\xbf\x01\0\0\0"
/* lea rsi, [rip+0xa] (64-bit!) */
"\x48\x8d\x35\x0a\0\0\0"
/* mov edx, 0x13 */
"\xba\x13\0\0\0"
/* syscall */
"\x0f\x05"
/* xor eax,eax */
"\x31\xc0"
/* ret */
"\xc3"
/* UTF-8 💩 + space + "Hello, world!\n" = 19 bytes total */
"\xf0\x9f\x92\xa9 "
"Hello, world!\n"
"Who says main() has to be a function?";
Because why not include a hand-crafted poop emoji 💩 in our binary as well? 🥳
But Wait! There's More! 🔥
We can take things even further by adding some "security" and a factory pattern to our madness. (Isn’t it nice? 🤓)
__attribute__((section(".text")))
const char fun_impl[] =
"\x8eGlsbVdoZEV3cGw="
"R2xhZEkgQ2FuIEg="
"\x8d\x04\x37\xc3"
"ZWxwIHlvdSBvdXQ="
"VGhpcyBpcyBBU0N="
"SUkgQXJ0ISAgICA=";
typedef int (*fun_t)(int, int);
enum fun_factory_id {
ADD_FUNCTION = 0,
NOT_IMPLEMENTED,
INVALID = -1,
};
fun_t fun_factory(int x, int license_key) {
switch (x) {
case ADD_FUNCTION:
unsigned offset = fun_impl[ADD_FUNCTION] ^ license_key;
return (fun_t)(fun_impl + offset);
default:
break;
}
return NULL;
}
#define OUR_LICENSE_KEY (0xAE)
int main() {
fun_t add_fun = fun_factory(ADD_FUNCTION, OUR_LICENSE_KEY);
return add_fun(40, 2);
}
What’s Happening Here?
- We've embedded obfuscated data (a mix of actual machine code and encoded gibberish) in a static text section.
- Instead of directly exposing our "add function," we use a factory function that requires a license key to retrieve it.
- The license key is used to XOR the function offset, adding a basic layer of obfuscation (though let’s be honest, a decent debugger will expose this instantly).
- Our
main()
function calls the fun factory, retrieves the function pointer, and executes it.
This almost looks like a primitive software licensing mechanism, right? (Totally not a sketchy DRM system. 🙃)
Is It Secure?
Absolutely not. 🔥
- The obfuscation is weak and easy to reverse-engineer.
- XOR-based security is laughably easy to crack.
- Since function pointers are directly manipulated, this could easily lead to unexpected behavior or crashes.
But hey, it’s fun to pretend we’re doing serious security engineering while writing machine code inside a C string. 😆
Conclusion
I think we have played for long enough. But, what's the point? Are these ever used in the wild? Some embedded targets will have common functions programmed into the ROM while fabrication. And, there are two ways to resolve the function address, write a linker script, or do this gross thing:
/* vendor_specific_function.c */
#define SOME_ROM_ADDRESS (0xBADCODE)
void (*vendor_spefic_function)(void) = SOME_ROM_ADDRESS;
I hope you had fun exploring some of the strange corners of the C programming language. We’ve seen how C lets you bypass almost every safety mechanism, write self-modifying code, and even mess with memory in ways that feel like cheating.
But as always, with great power comes greater segmentation faults—if your environment even has segments to fault on, right?
Discuss on Hacker News