Torchlight III Launch Weekend Postmortem

Echtra Games and Perfect World Entertainment’s hack ‘n’ slash action-RPG Torchlight III launched into early access a few weeks ago. The game’s initial launch was plagued by an assortment of issues and errors that led to a less than ideal gaming experience. As a result, we can now check out this lengthy developer blog post that looks back at Torchlight III’s opening weekend and everything that went wrong there.

Here are a few snippets to give you a taste of what to expect:

Sometimes the best possible way to look to the future of a project, is to understand its past. While next week’s State of the Game will take a deeper dive into the upcoming updates, ongoing issues, and other community feedback, this State of the Game is for those who enjoy a good technical read and want a postmortem on our Early Access weekend launch. This was written shortly after the Early Access launch weekend in order to retain as many details as possible.

Introduction

Wow, what an exciting weekend it’s been. On Saturday morning, we launched Torchlight 3 for early access on Steam, after having been in alpha and then beta for 15 months. After a year of having live players from all over the world playing our game, we had developed a lot of technology and procedures for deploying changes, fixing issues, investigating problems, triaging bugs, and the other fine-grained minutiae of running a live service. The game servers and game clients had been running fairly smoothly, so we felt prepared to open up the floodgates.

Life likes to throw curve balls, and we had a ton thrown at us over the course of this past weekend. Our team was up through the weekend chasing down problems, and now that things have settled down and our service is in better shape, it’s worth taking a look back. So, let’s go on a journey of the release and see what happened and how we fixed it.

Torchlight 3 Launches

Our live team got together early on Saturday morning. Or, at least, early for game developers. At 10am, we had our build ready to go, and being white-gloved by QA. A white-glove test is a test that QA does on a build when there are no other players on it to “check for dust” - that is, to make sure that there are no surprise issues. That process was complete, and at 10:30am - with no fanfare - the game was available for purchase on Steam.

Just after 11am, Max went on the PC Gaming Show to talk about Torchlight 3, and told people that it’s live now. So much for no fanfare. Thus opened the floodgates, and people started playing. For a little while, things seemed to go well. Our concurrent users (CCU) numbers went from zero up to a few thousand in a very short timeframe.

Very quickly, though, we started to get reports of players disconnecting, failing to log on, and failing to travel across zones. The team jumped into action like a cheetah on a pogo stick. We determined that there were likely two root causes here: service scaling and server reaping.

Most of our back-end services are horizontally scalable. That means that, if one of the services is under high load, then we can bring up more instances of it to handle the load. We did exactly that to handle one of the problems. The fact that the load was so high was indicative of other problems, but increasing the resources allotted to those services helped resolve those issues while we investigated the root causes.

The other problem was server reaping. For some reason, the services that were monitoring our game servers were not getting informed that those servers had players on them. We tracked down the connection issue and the disconnects stopped.

[...]

Conclusion

In any distributed system, there is always going to be more than one root cause of any perceived issue. We had to dive through about four levels of other problems before we even realized that the zombie zones were an issue in the first place, and then once we tracked down the problem there were several places in our code that had to be fixed in the same way.

The lesson here for us is that the first impression of what a problem’s root cause is is unlikely to be the actual problem, so we need to constantly be re-evaluating our fundamental assumptions.

As we move forward from this Early Access launch, we want to thank all of our fans and players for your patience as we worked through these problems. We know that it can be frustrating sometimes, but rest assured that we’re working tirelessly to fix the problems, and that we are listening to your concerns and feedback about the game.

- Guy Somberg