The other day I was playing around with CVE-2013-2028 along with my peer Hong Hu when we came across something odd: CVE-2013-2028 is only exploitable on 64-bit GNU/Linux when ASLR is enabled. After confirming this observation multiple times, we were left very surprised. How could ASLR possibly worsen the security of an application? Driven by curiosity, we decided to find the root cause of this result. Ultimately, we had to go all the way to the Linux kernel code to find our answer. What we found was a kernel quirk that can't really be called a bug from the kernel's perspective, but does go against the expectations of the user. So without further ado, allow me to share how ASLR can enable the exploitation of applications.
For those unfamiliar with CVE-2013-2028, all that needs to be known is it's an exploitable vulnerability in older versions of nginx stemming from a stack buffer overflow that can be triggered by specially crafted HTTP requests. The bug occurs because an integer provided to nginx by the user that is intended to be an unsigned value is accidentally casted temporarily into a signed value. If an attacker passes a sufficiently large value, the worker thread handling the request will copy too much data from its network socket into a fixed sized buffer causing the stack to get smashed. For the curious reader, a more in-depth analysis is available here and a repository for reproducing it is available here.
So why is this bug only exploitable when ASLR is turned on? We can find the
user space answer with a simple strace
. If we make a chunked HTTP request and
claim the total size is going to be 0xaaaaaaaaaaaaaaaa
, nginx's worker will
make a recvfrom()
system call for 0xaaaaaaaaaaaaaab0
bytes from the network
socket. When ASLR is turned on, the Linux kernel will copy our request (which is
not actually 0xaaaaaaaaaaaaaaaa
bytes long) into the worker's buffer, smashing
the stack. However, when ASLR is turned off, the kernel will return -EFAULT
and
the worker will safely report the error and close the session.
We could stop here, but Hong and I were not satisfied. Why is the kernel returning
-EFAULT
when ASLR is disabled but not when it is enabled? The space allocated for
the stack is the same in both cases, so that can't be the problem. The only obvious
difference is ASLR moves the stack's address range to randomize it. When ASLR
is disabled, the stack's highest address is placed at the boundary between user and
kernel space, which is 0x7fffffffffff
in Linux kernels compiled for x86_64
. However,
0xaaaaaaaaaaaaaab0
is such a large number it shouldn't matter where the stack is
placed. It's not going to fit into the memory segment and it's going to cross the
boundary. So what's really happening in the kernel when it handles a recvfrom()
system call?
Taking a look at Linux's
implementation
of recvfrom()
, we see the following code:
SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
unsigned int, flags, struct sockaddr __user *, addr,
int __user *, addr_len)
{
struct socket *sock;
struct iovec iov;
struct msghdr msg;
struct sockaddr_storage address;
int err, err2;
int fput_needed;
err = import_single_range(READ, ubuf, size, &iov, &msg.msg_iter);
if (unlikely(err))
return err;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (!sock)
goto out;
msg.msg_control = NULL;
msg.msg_controllen = 0;
/* Save some cycles and don't copy the address if not needed */
msg.msg_name = addr ? (struct sockaddr *)&address : NULL;
/* We assume all kernel code knows the size of sockaddr_storage */
msg.msg_namelen = 0;
msg.msg_iocb = NULL;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
err = sock_recvmsg(sock, &msg, flags);
if (err >= 0 && addr != NULL) {
err2 = move_addr_to_user(&address,
msg.msg_namelen, addr, addr_len);
if (err2 < 0)
err = err2;
}
fput_light(sock->file, fput_needed);
out:
return err;
}
This code performs two relevant checks. The first occurs in:
err = import_single_range(READ, ubuf, size, &iov, &msg.msg_iter);
And the second occurs in:
err2 = move_addr_to_user(&address, msg.msg_namelen, addr, addr_len);
However, we can rule out move_addr_to_user()
because it's passed
the number of bytes actually fetched from the socket, which is the same in
our attack regardless of ASLR. This leaves import_single_range()
, which is
implemented
as follows:
int import_single_range(int rw, void __user *buf, size_t len,
struct iovec *iov, struct iov_iter *i)
{
if (len > MAX_RW_COUNT)
len = MAX_RW_COUNT;
if (unlikely(!access_ok(!rw, buf, len)))
return -EFAULT;
iov->iov_base = buf;
iov->iov_len = len;
iov_iter_init(i, rw, iov, 1, len);
return 0;
}
EXPORT_SYMBOL(import_single_range);
In this function, a sanity check is performed via access_ok()
to make sure
the number of bytes requested by the caller cannot cause a write that would
cross into kernel space. But as we pointed out before, the value nginx's worker
is passing here is 0xaaaaaaaaaaaaaab0
, which should easily cross the boundary
regardless of ASLR. The type size_t
is defined as an unsigned 64-bit integer
in our case, so access_ok()
should be passed 0xaaaaaaaaaaaaaab0
, right?
Actually, if we look more closely, we can see the following lines enforce a
limit on len
:
if (len > MAX_RW_COUNT)
len = MAX_RW_COUNT;
If we lookup MAX_RW_COUNT
, we can see it equals (INT_MAX & PAGE_MASK)
,
which turns out to be a 32-bit value. So in other words, even though recvfrom()
allows 64-bit unsigned integer lengths on x86_64
, import_single_range()
truncates
them into 32-bit unsigned integers! On a 64-bit processor, this truncation
combined with ASLR's relocation of the stack allows our attack to pass the
access_ok()
check and smash nginx's stack.
Technically, this isn't a bug from the
kernel's perspective because import_single_range()
also calls iov_iter_init()
with the truncated length. This means recvfrom()
can only receive up to the truncated
length worth of bytes from the socket and therefore passing the truncated value to
access_ok()
is safe.
That said, it's a really odd way of implementing this system call. From the caller's
perspective, it's not made clear that even though it can pass a 64-bit length, only
the lower 32-bits will be considered. Also recvfrom()
treats the length as 64-bits
all the way through its logic, so it's not immediately obvious that the length is
being truncated by MAX_RW_COUNT
. Additionally, as Hong and I discovered, there
is a security consequence to this choice. Performing the access_ok()
check on
the truncated length allows network attacks that rely on integer overflow and
underflow to succeed where they would otherwise more likely be blocked by the kernel
due to a failed system call. We find this to be an interesting consequence since
it results from seemingly unrelated design decisions. It is hard to recommend that
the Linux kernel developers revise import_single_range()
given that the real
problem is a bug in nginx and not the Linux kernel itself, but we find this
discovery fascinating regardless.