Solving problems with very large java heaps
Users depend on our sites to be up and running at all times. We have many critical services that are required to be up and running in order to deliver that uptime, and thus those critical services need to be able to respond to changes in user behavior and load. This is a story about one of our hotel content services that feeds large amount of hotel content to systems throughout Expedia, how it temporarily ran into problems, and how we fixed it.
This content service is responsible for serving content such as amenity descriptions, hotel descriptions, image metadata, etc. in 40+ languages for more than 230K properties (and counting). The result is a dataset of almost a terabyte of data that keeps growing every day.
The current version of these services uses a distributed computing grid based on a piece of commercial software. It has served us well and while no technology is perfect, we were satisfied with the the performance of our grid.
Erroneous assumptions can be disastrous. — Peter Drucker
Back in January we began to see occasional spikes in the amount of data being processed in the grid, which were translating into a small number of response failures. We initially thought that these spikes were related to traffic and started working on solutions to solve for periodic traffic bursts.
What we didn’t know at the time was that we were not experiencing traffic bursts. Our incorrect assumption was based on a misinterpretation of the internal metrics and transaction logs which were indicative of piling up of requests more than a spike in received requests. We started designing a solution to address similar problems of traffic spikes, on a newer REST API. We also investigated alternate solutions to our commercial solution, such as Hazelcast in the cloud with AWS. For many months, despite these spikes, the system ran fine and we did not have any major issues.
In late summer (which tends to be a very high traffic time for travel companies) we started seeing significant failures. Following our theory of “increased traffic” we added more clusters to support the load and we did in fact see some temporary gain in stability. However the root cause was not cleanly identified and we started experiencing recurrent issues with both new and old clusters. The typical symptom was that one node would go down followed by another one few minutes later, and another one until the whole cluster would finally crash.
Working with the vendor we were able to identify some full garbage collection (GC) pauses that were impacting the system. These stop-the-world GC pauses would last from 2 to 5 minutes and the data nodes experiencing these pauses would get excluded from the distributed system, therefore putting the whole system in jeopardy. The most disturbing part was that the same dataset was running without failures on the previous version of the vendor’s product. Our new goal was to eliminate these blocking GC pauses. The priority became to reproduce the pauses in the lab environment.
Unfortunately a lab reproduction turned out to be elusive. Despite being able to generate high traffic volumes, the stress environment only partially reproduced the read and write patterns from production. In the production system the content is in a constant state of flux with high and varying levels of load, as well receiving large amounts of new content from real time updates and larger batch processes. These processes cause even more writes to the data grid.
Meanwhile, production failures kept happening with different symptoms and made the investigation even harder to narrow down. So the issue could not be reproduced in lab and it was not reproducing consistently in production either. Several different garbage collection parameters were iterated through without much success. Production was working, but experiencing periods of slow responses and degradation that we weren’t remotely happy with.
The break in the investigation came by throwing out assumptions already made (as it almost always does in these situations) and doing a very deep dive of all the metrics. We uncovered that the full GC pauses were being initiated not from a lack of total available memory, but by high heap fragmentation. We were using large JVM heaps (110GB) on each of the data nodes and the Concurrent Mark Sweep garbage collector to reduce the GC pauses and footprints. However, that GC strategy does not compact the heap while collecting garbage and because of the high diversity of content per lodging property the heap was getting more and more fragmented over time.
What is heap fragmentation?
Heap fragmentation occurs as objects are allocated and deallocated. It is much more efficient for programs to work with contiguous blocks of memory so that is what the allocators give out when something asks for memory. Small requests break up that contiguous memory and eventually the available contiguous chunks of available memory become smaller and smaller. Then the Java VM must trigger a full garbage collection during which time it compacts the heap to make more contiguous memory available.
Java object allocation on the heap and fragmentation issues
Now we understood the problem, that it had been around for a while, and that only now were we feeling the full impact. We finally reached a heap usage level where full GC pauses were required to allow space allocation for newer objects on the heap. Another reason why we hadn’t noticed it earlier is that using basic GC logging was not enough to track this fragmentation issue. It took the addition of an undocumented flag to see the problem:
One very obvious question is why a distributed system would not be able to handle one or multiple node failures. After all support for redundancy and failover is one of the reasons why we chose a distributed grid solution in the first place. In our case was that all nodes were running close to max capacity so each time one node failed, the other nodes become the primary distributors for the data residing on the failed node, increasing their own heap usage level (which was already suffering from the same fragmentation issue) and ultimately leading these nodes to fail as well. Effectively we had a capacity issue related to data storage and not related to traffic load.
As a stop-gap we reduced the total number of copies in the system from 3 to 2 as this would bring us enough redundancy in case of node failure while saving a lot of memory, and increased the heap size by 10GB. The longer term solution was to redesign with a newer approach, using more instances of smaller JVMs of 64GB per server. This would spread the load more evenly and reduce the duration of the full GC pauses. We also updated the cache eviction threshold to allow the distribute grid nodes to start pushing data to disk instead of RAM when they reach 80% of heap utilization.
It’s okay to make mistakes but you should try not to make the same mistakes twice. Through this experience, we learned some valuable lessons:
- Keep heap usage in any application running large heap (64GB+) lower than 60% at any time. We had been running “hot” heap usage for the last year and we were bitten hard by it. Running above 50% heap usage should automatically trigger some capacity increase planning or a review of the existing data model to reach lower heap usage. In our case, organic data growth as well as extensive content generation capabilities to run A/B testing caused a considerable data volume growth over the past 18 months.
- Avoid full GC pauses by any means in any distributed cluster as the node running full GC is likely to get removed from the distributed cluster and in our case the version we were using had trouble re-joining. Using the famous G1 collector also proves to suit better the type of large VMs we are using and the plan moving forward is to run all the data grid nodes with this collector. It is meant for handling large heap JVMs with predictable GC pauses and compaction without sacrificing on throughput performance.
- Review all GC settings each time you upgrade major Java version (from 6 to 7, or 7 to 8, etc.).
- Review all distributed cluster/caching parameter settings each time you scale up the cluster capacity (number of buckets, for e.g.) or upgrade to a newer major version of the product. We had a very old setting that was preventing the distributed system from properly adjusting to the departure of any node in the system.
- Log the right information in the transaction log files. Traffic bursts were claimed to be the main problem; we had to perform a code fix to add the request receive time in the logs to confirm that these issues were not related to traffic spikes.
- Whenever possible, capture live traffic patterns to be able to reproduce the same traffic in lab environments for troubleshooting.
- More aggressive roadmap to upgrade older versions of software for Java and any third party product.
- Finally, last but not least, we learnt to address the symptoms by being more proactive in increasing capacity horizontally and vertically, despite not first knowing the root cause of the problem.
For the Content Systems teams,
Patrick Bradley, Philippe Deschenes and Rolland Mewanou.