{ heap corruption }

  • Heap Corruption Follow-up: Size, Alignment, and Crashing Collections

    Alignment and buffers and tails, oh my!

    Posted on by Greg Heo

    Tags:

    After a massive multi-day effort, our team found one of those rare bugs that turned out to be an issue deep in the compiler. My colleague Agnes took the lead to track down the issue and wrote up her findings about her adventures.

    Now that the issue is fixed and the hard work is done, I can step in and do a little post-mortem. šŸ”Ž What was the problem? What was the fix? What did we learn?

    Smashing the Stack

    The classic article ā€œSmashing the Stack for Fun and Profitā€ from 1996 outlines a technique to execute arbitrary (and possibly malicious) code via a program that doesnā€™t do proper bounds checking. In the 20 years since, operating systems and programmers have become much more clever about stopping these exploits.

    Sometimes though weā€™re our own worst enemies ā€” reading past the last element of an array or an off-by-one error that writes past the end of a buffer is as easy to do as it was 20 years ago.

    In our case, we were getting heap corruption errors, meaning something was accessing memory it shouldnā€™t be accessing. But Swift is a safe language, isnā€™t it? How was this possible?

    To get to the root cause, we first need to understand two concepts: memory layout, and tail allocation.

    Memory Layout

    When instances are laid out in memory, there are several numbers to consider:

    Size is the number of bytes to read so you have all the useful information.

    Stride is the distance between two values. If you have a contiguous array of items, the stride is the number of bytes to advance to reach the next value.

    Alignment helps determine where in memory your data can start. Depending on the type and the CPU, you could have requirements such as ā€œdata must start at an even memory addressā€ or ā€œdata must start at a multiple of 8ā€.

    Size, stride, and alignment diagram

    If you want a longer discussion and more examples of memory layout, check out my article about Size, Stride, and Alignment in Swift.

    Tail Allocation

    Do you remember tagged pointers from the Objective-C days? The idea was: why spin up an entire object just to hold something like an NSNumber containing a boolean? A boolean is a single bit, after all.

    Pointer to an object that stores a Boolean value

    So on a 64-bit system, a pointer to an object takes 64-bits. But forget the pointer, we could store entire values with room to spare in 64 bits. Booleans, integers, even short strings could fit.

    What a tagged pointer with a Boolean value might look like

    Itā€™s a size optimization ā€” although we could in theory address 264 bytes of memory (thatā€™s 18 exabytes, or a 18 million terabytes), thatā€™s overkill, isnā€™t it? Why not use a few bits for flags, leave enough space to address memory, and then you can support features like tagged pointers.

    Similarly, when you allocate something like an array or dictionary in Swift you get a little extra space at the end.

    Sometimes this is due to alignment: since the next thing in memory needs to spaced out a bit more, why not use that blank space ā€œbetweenā€ size and stride? Other times, the system might reserve a little extra space for expansion so you donā€™t need to do an expensive re-allocation and copy later. This extra space is the tail allocation.

    Thanks to Karoly Lorentey for explaining tail allocation to me via Twitter.

    Size, Alignment, and Crashing Collections

    Now, on with the story.

    Letā€™s say you want an array in Swift. The runtime goes ahead and allocates some space, and also adds on some tail-allocated space. Hereā€™s the method that does the work:

    1
    2
    3
    4
    llvm::Value *irgen::appendSizeForTailAllocatedArrays(
    IRGenFunction &IGF,
    llvm::Value *size,
    TailArraysRef TailArrays)

    Weā€™re passing in an llvm::Value for the size, and the function returns another llvm::Value representing the new total size. For example, you ask for a 44-byte array and the function returns a value of 48 to round it up to a multiple of 8. Thatā€™s four extra bytes of tail-allocated storage!

    In practice, the system might add enough space for a few extra elements rather than just the ā€œrounded upā€ space due to alignment.

    Big Data

    So, letā€™s say we have a array. We allocate enough space for some values:

    Space for a array

    Then the compiler helpfully adds on tail-allocated storage, enough to hold an additional value:

    Tail-allocated space

    Now hereā€™s the kicker: in our app, weā€™re storing values of double3 type. A double3 has a size and alignment of 32 bytes. But a 32-byte type will certainly fit in the tail-allocated space:

    Looks like there's enough room

    But it may not be aligned properly. At runtime, if the array decides to use its tail-allocated space it will also make sure to write that value to a 32-byte alignment boundary:

    Buffer overflow in tail-allocated space

    The tail-allocated size is sufficient, but the system didnā€™t take alignment into account. The alignment boundary we need is not at the start of the tail allocation.

    The result? A buffer overflow. Corrupted heap. šŸ’„

    Alignment All the Way Down

    What were the problems and solution here? Two things:

    1. What gets stored in the tail-allocated space will affect the alignment of the entire type.

      In the method declaration above, you saw how you pass in a size and get back a size. If you look at the commit that fixes the issue, the method now returns both a size and an alignment as a pair (aka a two-element tuple).

      This change recognizes both size and alignment of the tail-allocated space when determining size and alignment of the overall space.

    2. On Apple platforms, heap allocations are aligned on 16-byte boundaries.

      That means your types that are 1, 2, 4, 8, and 16 bytes wide are also aligned. But a double3 needs to be 32-byte aligned. This second commit updates the runtime to call AlignedAlloc() rather than malloc() to ensure proper alignment.

    A big round of thanks to Erik Eckstein, Arnold Schwaighofer, Mike Ash, and Jordan Rose for their work and reviews on the two pull requests.

    Final Lessons

    Iā€™m not a compiler engineer and I donā€™t know all the details about this bug, but Iā€™m a firm believer in the value of reading code ā€” especially code you donā€™t understand yet.

    Agnes wrote a great summary of her lessons learned while tracking down this bug; what things can I add after digging a bit more into this problem?

    1. Keep code distance short.

      Tail allocation is a relatively new feature. It sort of bolts on top of the usual allocation flow, and thereā€™s more code to follow and to understand to trace the thing end to end.

      Sometimes this is a problem in my own code. Do classes have high levels of cohesion? Are related things close together, or separated by many frameworks and modules? How many files are there to go through and how big is the stack trace to understand some area of the code?

    2. Learn how to trace values back in time.

      Agnes covered the magic of going back in time (via TestFlight builds and version control) as a debugging tool.

      Within your codebase, can you pick a variable and see its changes as it moves through your program? If the alignment is 1 at the beginning and 16 by the end, why? And how? Why isnā€™t it the correct value of 32? If the buffer looks correct before adding data and then has data written past the end after initialization, why? And how?

      Learning how to use breakpoints, watchpoints, conditional breakpoints, etc. is invaluable here as you trace values over time.

    Although it was tough work to track down this bug, I had a fun time doing a little digging after the fact on the cause and fix. Iā€™m super impressed at how quickly it was fixed in the compiler and now look forward to the months ahead of a completely bug-free system. šŸ˜‰


    Thanks to my colleagues Agnes and Alexis for their help reviewing this article. All remaining errors are of course intentional and meant to test your attention. šŸ˜œ

    If you have questions or feedback for me, you can get in touch via Twitter where Iā€™m @gregheo. If youā€™re interested in working with us ā€” weā€™re hiring, and looking for an expert iOS developer. You could be working on the future of AR on iOS and writing about it on this fine company engineering blog!

Topology makes custom eyeglasses and sunglasses, perfectly sculpted to fit one person at a time. Our app combines video capture, 3D rendering, and Core Motion (among other things) to create a premium experience for our users. Sound interesting? Weā€™re hiring, please get in touch!

Ā© 2017ā€“2018 - Topology Eyewear
g src="/engineering-blog/images/heap4.png">

The folks at Apple might be unaware of the bug, and there might be lots of people out there crashing for the same reason. They need your help to know if thereā€™s something like this going on.

Iā€™m lucky to have a friend and fellow Hungarian on the Swift standard library team, Karoly Lorentey, who reached out to me about the issue once I started tweeting about my progress. They were very excited about my findings, and once I submitted the bug report Erik Eckstein submitted a fix almost right away! The fix was shipped in Swift 4.1 and Xcode 9.3 beta 4.

I can confirm that the version of our app that used to crash before does not crash when built with the new Xcode beta. šŸŽŠ

Karoly was also kind enough to explain what the problem was, so letā€™s briefly go through the cause of all the heap corruption craziness.

The actual problem

The actual problem is not restricted to ā€œoldā€ devices. āŒ
Every platform, and every device is affected. āœ…

The actual problem is not Dictionary nor double3 specific. āŒ
All collection types are affected, when storing types with alignments greater than 16 bytes. āœ…

When their elements had unusually wide alignments, storage for the standard libraryā€™s collection types was not guaranteed to be always allocated with correct alignment. If the start of the storage did not fall on a suitable address, Dictionary rounded it up to the closest alignment boundary. This offset ensured correct alignment, but it also meant that the last Dictionary element may have ended up partially outside of the allocated buffer ā€” leading to a form of buffer overflow. Some innocuous combination of OS/language/device parameters probably caused this issue to trigger more frequently ā€” which is probably why it became noticeable on particular devices running iOS 11.

What is alignment?

Alignment defines the amount of padding needed to make data line up on ā€œevenā€ memory addresses. Processors are most efficient with memory access if your data is aligned properly; some can still work with unaligned addresses but there can be a performance hit. ARM is more strict for instance, unaligned accesses (when not allowed) will cause an alignment fault.

The fact is, that not many of the Swift types have alignments greater than 16 bytes. The reason why this issue occurred in our app is because we are using simd types extensively. simd types tend to have wide alignments.

Even though we tried debugging the problem with the Address Sanitizer and Instruments memory inspection, we never caught it. I wonder why. šŸ¤·šŸ»ā€ā™€ļø

Lessons learned

Identifying and fixing bugs is a big part of a software engineerā€™s job. We all know how to do it, but sometimes itā€™s nice to review some of the big-picture steps:

In the end, I had a simple project that exercised the bug along with copious notes, more than enough to file a bug report with Apple.

I learned that a methodical approach and a little brute force can wear down any problem.

And finally, I learned that sometimes your CEO can be right. Sorry Eric. šŸ˜‰


If you have any questions, comments, or feedback regarding this article, you can find me on Twitter as @vasarhelyia. All very welcome. DMs are open, too. šŸ’Œ
Thanks to my lovely team for polishing this article! šŸ’œ

There is a second part to this article coming up explaining what the problem was in more detail and how the Swift team at Apple solved it. šŸ¤“ Stay tuned!