Solving a Mysterious Heap Corruption Crash

Or, how to track down a subtle Swift compiler bug

Posted on by Agnes Vasarhelyi

A while back, we noticed an increase in crashes in our app. The crashes were marked as heap corruption, which makes them hard to debug — the location given in the stack trace (if any) can be far away in both code and time to where the problem actually lies.

After a long investigation down many paths, it turned out to be an issue in Swift itself. After sharing a few tweets about it on Twitter, I had a lot of people asking me for more details, so here I am, sharing the story of the mighty heap corruption issue.

Are you ready for a tale of woe, frustration, and ultimately, redemption? Are you curious about how other iOS teams — ours, in this case — investigate and track down bugs? Read on.

First, let’s have a look at what heap corruption is

Heap corruption occurs when dynamic allocation of memory is not handled properly. Typical heap corruption problems are reading, or writing outside of the bounds of allocated memory, or double-freeing memory. Since the result (e.g. a hard crash) can happen later, when the program tries to manipulate the incorrectly allocated piece of memory, the root cause of the issue can remain hidden from your eyes.

Gathering signal with crash reporting

It all started with reports from Crashlytics about an increasing number of heap corruption issues. The content of the issues was not helpful, because where it crashes has little relation to where the real problem is.

Once the number of these issues started increasing we started getting more and more nervous. Crash-free user sessions went down from almost 100% to 96% in a few months. 😨

The increase in crashes lined up with a Crashlytics SDK update, so I started by asking if anything might have changed on their end:

Perhaps it was improved issue tracking? Maybe they enabled something that now sends us all the exceptions thrown from system libraries?

They were quick to respond and the answer was a definitive no. The problem was indeed our problem, and it had nothing to do with any change at Crashlytics. ✅

Try to reproduce the issue

By looking at the device types I realized we only had crashes on iOS 11 and only on older devices — iPhone 6, 6 Plus, 5S, SE, latest iPod touch 6 and iPad Mini.

Unfortunately, our older test devices in the office were all on iOS 10! We probably hadn’t tested the app on iOS 11 on any of these devices, ever.

Lining up the crashes with analytics events, it looked like the app crashed once people opened our Try On view, the AR view where you try on your Topology glasses. This seemed reasonable, as that screen is a heavy one full of Scene Kit and Metal, allocating significantly more memory than other parts of the app.

We were able to reproduce this crash ourselves, which was a great first step. Now we could start investigating which part of the code was the problem.

Go back in time

The classic way to track down the source of a bug is to bisect. Try a version of your app from last month, see if it crashes, and then try another one either before or after that. Eventually with enough tries, you can narrow it down to an exact commit.

I found a version of our app from eight months ago, before iOS 11 was introduced, and ran it on an affected iOS 11 device. It still crashed. ✅

The eight month old app was fine running on iOS 10 but not on iOS 11. Conclusion: something changed in iOS 11 to trigger the crash. Our working hypothesis was that iOS 11 uses more memory than iOS 10, and the increased memory pressure causes the app to crash on older devices.

Challenge every assumption

The team had a suggestion for me to validate the hypothesis: run the app on an iOS 11 iPhone 7, and get it to crash.

If memory pressure was the issue, I could malloc big chunks of memory and then enter our Try On view. No crash.

Our hypothesis was incorrect, but at least we had scratched off a strong possibility. ✅

Analyze and slice the code

At this point, we tried to think about how this problem could happen in the first place.

We knew there was something in the app that mishandled memory in a way that it corrupted the heap. Swift is mostly memory safe, so unless we were doing something exotic, we should be safe.

However, we did have some exotic code in there to examine for issues:

1. Incorrect pointer manipulation (like double freeing)

We reviewed all our code doing raw pointer handling. We have some C++ code and some low-level graphics code, both being good candidates for incorrect pointer usage, but all turned out fine. ✅

2. Thread data races

There’s a great tool called the Thread Sanitizer built in to Xcode that helps you find data race issues in your app. Unfortunately, it only runs on the Simulator and much of our app uses features that are unsupported there. The parts that do run in the simulator worked just fine and didn’t trigger any Thread Sanitizer warnings.

We manually tracked down every piece of concurrent code in our app, marking them as safe, or to be inspected. All turned out to be safe. ✅

Now what? 😳

Interlude

“What if it’s a bug in iOS 11?” - Eric, our CEO

Blaming the platform or the system frameworks is easy to do, but it’s such an unlikely occurrence. I just smiled at our CEO, saying “I don’t think so”, not knowing yet that he was very close to the truth.

Even if we immediately jumped to the conclusion of a Swift bug, we still needed to find a reproducible case to file a bug report.

“When in doubt, use brute force” — Ken Thompson

Finally, I went to a tried and tested method: brute force. Also known as “ripping the app apart”. 🔪

Remove third-party code

I removed every third-party dependency, to exclude the possibility that the problem is not in our code. Luckily we’re very strict about not adding many third-party libraries and the ones we do have are mostly supporting easily isolatable code, like SSZipArchive.

Still crashing. ✅

Next, I removed our own framework of commonly used components.

Still crashing. ✅

Move suspicious pieces to an empty project

I was pretty suspicious about the heavyweight AR view, so I pulled it out into its own project.

Still crashing, great! ✅

I had a wild idea and removed the whole AR view.

Still crashing every second run. ✅ 😳

Confusion was now at its peak level. 🤯

If you get stuck, choose a new angle

At this point my most solid theory of blaming the AR view was ruined, so I had to try another angle.

The code was fairly slim at this point - a few thousand lines of parsing 3D models into all kinds of data structures. Nothing concurrent, everything running synchronously. I wanted to try and look at the crash site again. Even though I knew the cause of the heap corruption could be elsewhere, seeing the stack trace in the same piece of code every time made me want to look closer there.

The pattern I started to see was that there was always a Dictionary involved, and there was always a simd type such as double3 in the dictionary.

At this point, I was ready to give up after a week of this tiring hunt. 🏹🐞

But what if.. what if it’s really a Swift bug? 🙀

I opened up my MacBook again and tried something crazy.

Ten minutes later:

VICTORY! 🎊

The Aftermath

Finding the problem was a significant chunk of work, but that’s not the end of the journey. There are a few things to do after you finally figure out what’s going on.

Workaround

Because hey, your users are still out there, crashing. 😬

For us, the workaround was to replace all double3 values to float3. After stress-testing dictionaries with float3 instead of double3 values, the app seemed to be stable.

We still had no idea why double3 in dictionaries was a problem, but we wanted to submit a quick fix to the App Store. This part of the story has a happy ending: ever since the change landed we’re back at a 100% crash free session rate. 😎

File a radar

The folks at Apple might be unaware of the bug, and there might be lots of people out there crashing for the same reason. They need your help to know if there’s something like this going on.

I’m lucky to have a friend and fellow Hungarian on the Swift standard library team, Karoly Lorentey, who reached out to me about the issue once I started tweeting about my progress. They were very excited about my findings, and once I submitted the bug report Erik Eckstein submitted a fix almost right away! The fix was shipped in Swift 4.1 and Xcode 9.3 beta 4.

I can confirm that the version of our app that used to crash before does not crash when built with the new Xcode beta. 🎊

Karoly was also kind enough to explain what the problem was, so let’s briefly go through the cause of all the heap corruption craziness.

The actual problem

The actual problem is not restricted to “old” devices. ❌
Every platform, and every device is affected. ✅

The actual problem is not Dictionary nor double3 specific. ❌
All collection types are affected, when storing types with alignments greater than 16 bytes. ✅

When their elements had unusually wide alignments, storage for the standard library’s collection types was not guaranteed to be always allocated with correct alignment. If the start of the storage did not fall on a suitable address, Dictionary rounded it up to the closest alignment boundary. This offset ensured correct alignment, but it also meant that the last Dictionary element may have ended up partially outside of the allocated buffer — leading to a form of buffer overflow. Some innocuous combination of OS/language/device parameters probably caused this issue to trigger more frequently — which is probably why it became noticeable on particular devices running iOS 11.

What is alignment?

Alignment defines the amount of padding needed to make data line up on “even” memory addresses. Processors are most efficient with memory access if your data is aligned properly; some can still work with unaligned addresses but there can be a performance hit. ARM is more strict for instance, unaligned accesses (when not allowed) will cause an alignment fault.

The fact is, that not many of the Swift types have alignments greater than 16 bytes. The reason why this issue occurred in our app is because we are using simd types extensively. simd types tend to have wide alignments.

Even though we tried debugging the problem with the Address Sanitizer and Instruments memory inspection, we never caught it. I wonder why. 🤷🏻‍♀️

Lessons learned

Identifying and fixing bugs is a big part of a software engineer’s job. We all know how to do it, but sometimes it’s nice to review some of the big-picture steps:

  • Make sure you’re gathering the correct signal.
  • Find a consistent reproduction case.
  • Bisect, or find another way to go back in time to track back the problem.
  • Analyze and slice the code, and always challenge your assumptions.
  • Work your way down to the smallest and simplest bit of problem code.
  • Document everything!

In the end, I had a simple project that exercised the bug along with copious notes, more than enough to file a bug report with Apple.

I learned that a methodical approach and a little brute force can wear down any problem.

And finally, I learned that sometimes your CEO can be right. Sorry Eric. 😉


If you have any questions, comments, or feedback regarding this article, you can find me on Twitter as @vasarhelyia. All very welcome. DMs are open, too. 💌
Thanks to my lovely team for polishing this article! 💜

There is a second part to this article coming up explaining what the problem was in more detail and how the Swift team at Apple solved it. 🤓 Stay tuned!

Topology makes custom eyeglasses and sunglasses, perfectly sculpted to fit one person at a time. Our app combines video capture, 3D rendering, and Core Motion (among other things) to create a premium experience for our users. Sound interesting? We’re hiring, please get in touch!

© 2017–2018 - Topology Eyewear