Your mission: 100,000 simultaneous users. 38 days. 9 systems. One launch. Ready?
Are you sure you want to do this?
- What is it like to go from zero users to all the users in one day? How do you prepare for that?
- Have you ever dealt with performance problems in systems you don’t control?
- Have you ever negotiated a large-scale test with an external partner?
- Have you built an AWS architecture with the biggest possible pieces?
- Have you ever DDoSed your own site?
- Have you ever lost sleep because your application won’t horizontally scale?
Can't back out now
This is the story of a real-life product “land rush”. It starts with designing a load test, picking tools, and isolating a slew of dependencies. That’s the easy part. The fun really starts when the load test runs. Hold on tight and we’ll ride this swarm to the finish line.
The video above was recorded at ConFoo.ca on March 8, 2017.
Transcript of talk
[00:00:00] The title of this talk is going to 11, or maybe, How I Learned to Stop Worrying and Love the Swarm. And all good dramas, it's a tale in 5X. So my name's Steve Jackson. I'm a double agent at Test Double. That's my Twitter handle and my email if you want to get a hold of me. Again, I will talk about anything, but I'll talk about this in particular if you want to.
[00:00:22] That'll be good. Now, I had done performance testing and load testing in the past, but generally it was around, we had maybe an announcement coming out, and so we knew we'd have some extra load. Where we had a performance problem that we were trying to optimize away. Or maybe we were switching architectures and we wanted to make sure that the new one was at least as performant as the old one.
[00:00:43] This was the first project I was ever on where we went from zero users to all of the users in the first day. So I'm going to talk a little bit about how we got started. I'm going to talk about Locust and why it was a good tool for us and what sort of things it helped us with. I'm going to talk about some of the problems we encountered and how we dealt with them in the architecture.
[00:01:06] Hopefully, I will make a compelling argument for starting load testing earlier. And again, this was pretty dramatic to live through, so hopefully it comes across in the talk. So to give you some background, realtor is a top level domain, like com or io. And the idea behind realtor is that we would allow realtors to expand their web presence.
[00:01:28] By getting a domain name that matched their brand, essentially. Now, domain management is an esoteric and strange thing. A records, C names, SOAs. And realtors aren't really leading edge tech adopter type people. So part of our challenge here was how do you package that up in such a way that it's not completely incomprehensible.
[00:01:51] And in addition, there are lots of strange business rules around domain registration in general that you have to follow. So we had developers working on that for about seven months. Getting all that kind of squared away. Myself and Eric Hankinson were brought on board on the 15th of September to set up the production infrastructure and start load testing.
[00:02:10] We had a soft launch on the 20th of October and went live three days later. Now I'm going to talk a lot about myself during this talk, but there are a number of people involved and everybody was very important and great across the top. We have Charlotte Chang, Dan Parks, Eric Hankinson, and Michael Limley.
[00:02:29] In the middle, we have Nathan Wallace and Nick Barrett. Then myself, Tim Connor, Will Kessling, and Joel Viler. I want to give special shout outs to Charlotte. She was in her third week as an apprentice, and jumped on Zendesk in the middle of the launch to help with support. Eric was my partner in crime through most of this, and helped set up a lot of the initial AWS infrastructure.
[00:02:51] And Michael Limley was our iteration manager, and he went way above and beyond clearing roadblocks, having difficult conversations, and jumping on grenades. So I'm really thankful for all the things you did. Okay, so let's get started. We've got 38 days till launch. What are we going to do here? So this is what our infrastructure looks like up front.
[00:03:11] We're using an AWS cloud, and we have an Elastic Load Balancer in front of some number of web nodes, all running Nginx, Unicorn, and Rails. We've got Memcache for session management, Postgres is our database. And then we were using a proxy between us and some of our dependencies because they had whitelists for their IPs and we didn't want to be exchanging IPs every time we added nodes to the web, the load balancer.
[00:03:37] So on the surface, realtor is not that exciting of a site. You have to be a member of the National Association of Realtors or the Canadian equivalent to get a realtor domain. I can't buy one for instance. But other than that, you search for the domain you want, add it to your cart, and check out.
[00:03:54] Most of the interesting parts here is the sheer number of dependencies this had at launch, and there's probably twice as many now. Again, you have to be a member, that's very important, so we care about membership. We have payment processing, domain registration, setting up your domain and DNS after that.
[00:04:11] And then we had mail forwarding and hosting options, along with the ability to send like transactional emails, receipts and things like that. So again, we really didn't have any data to work from to build our load test. So what we did is we worked with our product owner and tried to define how people would use this site.
[00:04:31] This is the acquisition funnel he had in his head. We figured out of the people that entered the site, probably 75 percent would go ahead and register for an account. This wasn't a trivial step. You had to know your realtor identification number, for instance. But, if you went ahead and went through that process, almost certainly you'd search for a domain.
[00:04:50] Someone, most people would buy one. And then, there's some number of People we figured would already have existing hosting brokerages will often give you like a site with all your properties on it And you just want to point your new awesome domain at that existing site now with this funnel in place. It really let us Prioritize which of these dependencies we cared about which ones did we really need to make sure we load tested well, so Care a lot about membership payment processing and registration.
[00:05:20] We're not as concerned about mail forwarding and sending transactional emails Now, we knew there was no way we were going to get all of these partners on board for a load test at the same time. What we had to do was think of a way to stub them out of our architecture so that we could do load testing with individual partners and also give us the ability to load test our own system and find bottlenecks.
[00:05:45] I overthought this initially. I was worried about latency and would my little stubs not be as powerful. Don't worry about it. Like, all of these things were so simple. Either returning canned data or taking some part of the request and returning it. So in most cases, we ended up with little Sinatra web services running on itty bitty EC2 instances.
[00:06:05] Domain registration is a little different. It doesn't use HTTP as a protocol. It uses a custom protocol called EPP, which is XML over TCP sockets. So we had to build something a little more sophisticated there, but in general, these were pretty simple. We made sure we set them up in a separate region than our architecture, our normal system.
[00:06:24] We didn't want to run into a situation where our two pieces were next to each other in the same rack and there was no real network overhead to speak of. Still pretty fast because it's Amazon, but okay. So we have an architecture. We've figured out a way to stub out the dependencies. Now it's time to start talking about load testing.
[00:06:43] When you look at load test tools, generally, you're looking at tools that either scale vertically, meaning they're as powerful as the resources you put on them. So a tool like Siege kind of fits that mold. I like Siege a lot. It's a really simple tool if you just want to try and see if you can take down your server.
[00:06:59] But it wasn't powerful enough for what we needed to do. You need to scale horizontally if you want to simulate lots and lots of users. Our early clubhouse leader was Bees with Machine Guns. Both because it has a really cool name and it's also very cool that it does. It essentially spins up a whole cloud of little EC2 instances and they all come and hammer your site.
[00:07:19] I think the Chicago Tribune invented this but it lacked some of the scripting capabilities we needed to do our load tests. We looked very briefly at Gatling. io, which is a Scala based solution, and then settled on Locust after looking at some of the documentation that came with that. So Locust is Python.
[00:07:36] It's just an application. It can run anywhere you can run Python. We were using it to test a Rails app. You could test Laravel, you could test ASP. NET or Javaspring, it doesn't matter. These are simulating your users. The reason it won us over was that when you deal with a Rails site, all the pages that have forms have what are known as cross script request forgery tokens in them.
[00:07:58] That's essentially to make it so you aren't spoofing real web traffic like exactly what we were trying to do. So what we could do with Locust here was essentially make a request, pull out the part of the page we needed, and then use it in all of our subsequent posts, which was very useful. With that in place, we could essentially pretend like we were real users doing interactions with sessions and cookies and everything else and getting through a checkout flow.
[00:08:23] Again, Locust expands horizontally. So as many machines as you give it, it'll spread its load across those machines. And then a feature we overlooked initially, but ended up being really useful for us is. You can best, you can specify the tasks that your simulated user is going to do and then you can weight them by percentages.
[00:08:43] Which happened to map perfectly back to our user funnel. 50 percent of our simulated users would go ahead and buy our domain. Which is great. This is probably the second time I'd ever really looked at Python. But I think it's a really nice DSL. It was very easy for me to understand at least. Each simulated user or locust has a type.
[00:09:04] In this case, we had visitors and registered users. And these are just Python classes that extend from HTTP Locust here. And what we could do is set it up so that 90 percent of our simulated users were registered users, and 10 percent were visitors. Another neat feature we have here is basically how Locust works, is it's a thread, and it will wake up, do a particular user task, and then go back to sleep for a while, and then wake up again.
[00:09:29] And so what Locust let us do here was set a sliding scale between 5 and 9 seconds, That way we didn't have the situation of a thundering herd where our script wakes up all of them every five seconds and just slams our server and then goes back to sleep. So each type of user has a set of behaviors it might do.
[00:09:48] In the case of the visitor here, we might just visit the page like, I don't know how I got here. Oops, let me leave. We have the ability to do what is known as a guest search. So without going through account registration, could I see if the domain I wanted was still available? No. And then you see these annotations here allow us to do the percentages.
[00:10:07] So for a visit, for a visitor, which is 10 percent of our load, 3 percent of that 10 percent might just visit, 7 percent would do the guest search and the other 90 percent would just go ahead and register an account and become one of our registered users.
[00:10:24] So this is what our tasks look like for guest search. And this kind of shows you the power of Locust over some of the other scripts, which are some of the other tools are just designed to hit an end point. So in this particular task, we need to make three requests against our server. First, we went to the homepage, which is where we had our search screen, and we pulled the CSRF token here, so that we could then use it in the subsequent posts.
[00:10:49] One of the domain, one of the business rules around a realtor domain is that it has to have your name in it. So in order for our search to return you relevant results, we need to know your name. So I could try to get steveisawesome. realtor, I could not get nancy. realtor, for instance. So that's what the second one is about, it's telling the server the name, and then we go ahead and do the search.
[00:11:10] That's it, that's our task that's gonna run 7 percent of 10 percent of the time, so 0. 7 percent of the time.
[00:11:19] Locust gives us a pretty nice web UI of what's going on in the system. Across the top here we have how many simulated users are in the system, so in this case we have 2400 locusts. Running across four slaves and we're getting 88 requests per second with 13 percent failure rate. The big table here is all of the end points that we're hitting, how many requests were made, how many times it was failing, and then these are all response times with medians and averages and mins and maxes and things like that.
[00:11:47] If we drill in the failures, we can see that our users were getting 502s from the load balancer. Basically our server wasn't responding fast enough. And, obviously not a good experience for our users, so we need to clean that up. Okay, so we've got our architecture, we've got our stubs, we've got our Locust script, now we need some place to run it.
[00:12:10] We set up another AWS cloud here with a master process and a number of slave machines. The master in Locust is responsible for taking all of the data from the slaves that are reporting back and giving you that nice web UI. So we use the two core machine for that. The Locust slave machines are all CPU bound.
[00:12:31] The bigger the machine, the more slaves you can run across it. The more users simulated users, you can run threads inside of those processes. So in that case, we were using eight core machines for that. I think we ended up with something like 10 of these maybe at some point. But we didn't start nearly that big.
[00:12:50] So our methodology at this point was let's start small. 1000 simultaneous users? We'll get some data and then we'll grow our infrastructure as we need. Seems like a plan. In the meantime, while Eric and I were setting all this up, the rest of the development team was working on more performance based things like adding indexes to a database.
[00:13:12] Search always needs another index. We're adding some fake data to the database. So we didn't run into a situation where the first hundred thousand requests were really fast and it just got slower and slower as we added more rows. And YSlow to optimize a particular request. So YSlow is very concerned about, say, your assets.
[00:13:34] Are you using the right headers? Are you using compression? Things like that. Load testing tools generally do not download all your assets. They just hit endpoints. So you need a combination of both tools to really get a feel for the user experience and how it's working out for them. So about this time we could probably do 5, 000 concurrent users without any trouble in our system.
[00:13:52] So it's time to do our first external test. We essentially unstubbed one of our dependencies and let the real dependency stand in place. And ran our normal load test. It went badly. 95 percent failure rate. What are we going to do?
[00:14:17] So what we decided was that anytime we talk to this particular external providers, that we would put it on a background task and run it as a job. So we made it asynchronous. This allowed us to hide any of those errors away from our users. Do things like retries, do things like throttling. And that helps solve that problem.
[00:14:37] Asynchronous jobs are often like the golden hammer of performance. You can solve a lot of problems that way. But they do have drawbacks, including that you're cutting your user out of the process. So if there is something wrong with the data you got from a user, and you wouldn't know it until you talked to the third party, now you have no way of letting your user know there's a problem.
[00:14:57] So it's a, it's nice, but it may be too easy to reach for sometimes. Okay. So at this point we we're throwing around numbers with our product owner. What could we expect on launch day? And the numbers we were throwing around at this point was something like 75, 000 users over the first three days.
[00:15:21] So my brain goes into story problem mode. All right, 75, 000 users over 72. Nobody's gonna buy a domain at 3 a. m. Okay, so let's say 10 hours a day over those 3 to 7 and maybe 2 to 5. Basically, I was thinking of this as if people would wait in line, get their domain from the Apple Store, and then go home happy, right?
[00:15:43] Our product owner had done this before, and he's no. That's not how this works. There is one Nancy dot Realtor in the entire world, and there are a lot of Nancy's. This is a lot more like Black Friday where there's one big screen TV in that store and everybody wants it. And by the way, based on our marketing, we really feel like 100, 000 is probably a more realistic number than 75.
[00:16:10] Now, I'm enough of an engineer and a worrier to be like, Okay if you're expecting 100, I should be able to handle 200, because I would hate for that one extra person to bring the whole system down. This kind of shifted our thinking rather drastically. We're at a thousand, 5,000 or so now, and now we're going to a hundred thousand, 200,000.
[00:16:32] We could imagine adding more nodes to the load balancer or adding more processes to handle background jobs, but we only had one database. And now do we want to get into sharding and distributed databases and all of that stuff? What are we gonna do? Luckily, our product owner says. Go to Amazon, get the biggest database they offer.
[00:16:53] So we did. 32 cores, 244 gigs of RAM. Hot failover between multiple, between availability zones. And then 200 gigabytes provisioned IOPS. We did not need 200 gigabytes of space, but with IOPS, the more space you give it the more performant it is. So give us what you got.
[00:17:17] And so at that point we really couldn't run 200, 000 users through our load testing framework yet. So we ran a few stress tests and just said, okay, this is gonna have to work. We don't really have time to re architect the entire application.
[00:17:33] And now at this point, My team had deployed to production several times solutions using Ruby and Rails and Unicorn, and it works pretty, it's worked pretty well for us. But the problem with Unicorn is that it forks a process for every single request. 200, 000 processes is a lot of processes. So we started looking around and at that time Puma was establishing itself as the performant Ruby web server, and it's pretty much there now.
[00:18:03] But the problem we had is we started looking through the documentation, and right on the readme it says all of the gems that you pull in should be thread safe. What does that mean? Is there a list somewhere? These gems are thread safe, and these are not? We've been pulling in gems for seven months on this project, and I'm a Ruby gem maintainer, and I have never given any thought to concurrency.
[00:18:24] I do this sort of thing in my free time. It solves a problem I have. I hadn't considered someone trying to run 200, 000 copies of it at the same time. So we decided that we couldn't really make that change either, and we would just go with Unicorn and some more. All right, so let's get some big web nodes to go with that big database.
[00:18:44] Again, 32 cores, 244 gigs of RAM. This is the biggest that Amazon had at the time, they have even larger now. If this doesn't seem good enough for you. I don't know about you, but that is a lot of resources. I had never provisioned anything nearly that big before in my life. So I was overwhelmed on where to start because you read the docs and it's like, all right, we need to have this per number of cores you have, set this setting according to this.
[00:19:10] And Amazon throws out elastic compute units for everything, whatever that means. Luckily, places like Engine Yard that provision servers all day post their numbers online, and so I can crib from that and start to build something up. So we're using Engine X in front of Unicorn. Basically to give us compression and serve our static assets.
[00:19:31] You should never ask your app server for static assets. It takes forever. But everything else, we'll go ahead and reverse proxy back to Unicorn. Again, the slides will be up if you don't want to follow along too closely here, but we ended up with 32 workers, one per actual core on the EC2 instance.
[00:19:47] And then for every request we send back to Unicorn, we're opening a Unix socket. So I need as many file handles as I can possibly get on this system. 64K of those, and then, I don't care about keepalives. I need to slam that connection shut as quickly as possible so I can support the next user. I can't have it using resources.
[00:20:07] Shut that down. As we got further along, we played with these numbers quite a bit. These are application specific, but they can, they got us some pretty nice performance wins as far as our capacity planning. Things like, read and write timeouts, and how much in client max body size. Our Unicorn configs, we started with 400 workers per node.
[00:20:30] We managed to trim that down to a hundred. Probably the most important thing here was the backlog. I think by default it's something like 64. This is essentially the number of requests that Unicorn will allow before it starts tail dropping and with all the resources we had we could easily handle a much larger line than 64 people.
[00:20:49] So we changed that. The other thing if you haven't used Unicorn before is if Unicorn kills your process It will throw away your stack trace and you'll have no idea why it took long enough for the unicorn to kill it. So what you can do is set up rack, which is inside rails to essentially time out sooner and get your stack traces again.
[00:21:07] And Ubuntu, we had to do a little bit of tuning. Cause I need as many file handles as I can get.
[00:21:14] So Locust gives us a good idea of what the performance feels like from a user standpoint. But as far as capacity planning, it's not very useful. I need to know how many nodes I should be putting in the load balancer. AWS has CloudWatch to give you some metrics, and for CPU on EC2 it's pretty good. But at the time there was nothing for memory.
[00:21:35] I think now you can download some Perl scripts. But it would have been really nice to know when my delayed job box kept running out of memory what was going on. But on the other side of the aisle, RDS, which is where Postgres was hosted, is great. They've got all the metrics you could want. CPU, memory, swap, queue depth, stuff like that.
[00:21:53] And connections, those are pretty important. What did I learn? I feel like we didn't really drive out this lots and lots of concurrent users thing until we started talking about the load. Up to that point, we had an assumption in our mind that was not correct. And, I would have really liked to roll out something like Puma and save on the resources.
[00:22:15] We spent a lot of money on servers. But it was one of those changes where I want to roll it out and let the developers live with it for a while and tell me if there are issues. So we, again, we rolled out with what we had and it was okay, but I feel like if we'd started sooner, we might have done things differently.
[00:22:37] Okay. 21 days out, three weeks. So around this time we were going to do another external load test with one of our partners. I was pretty nervous about this one. We had been negotiating for a while and this was very important to what we were trying to accomplish. So we got them to agree to set aside time and resources to do a load test with us.
[00:22:57] Okay. Using a production like infrastructure. That's often the problem with dealing with external dependencies They don't have a copy of production to just let other people use but if you're using like a test system You're not going to get real numbers. So again, I was nervous about this I'm running tests over the weekend to make sure everything's okay.
[00:23:14] I'm running tests get up early Monday start running tests and I'm working through all that. We're gonna do the load test right after lunch So we all decided to stay in and We're eating lunch, and we're joking, and watching YouTube videos, and having a good time. And about 20 minutes before time, all of my tests start failing.
[00:23:37] Stop the test, start them again. Failing. Restart all my servers. Everything's failing. So I had to make the call. I had to cancel the load test that we had spent all this time negotiating and getting ready because there's too much noise in the system to do anything else. I felt so terrible, and we looked bad, like we had no idea what we were doing.
[00:23:59] Essentially calling them up ten minutes beforehand and saying, Sorry, can't do it today.
[00:24:06] Turns out, we ran out of disk space. Now, I did not care about the logs that were coming from this system. We were going to send them all off to an external service. But Rails and Nginx still wanted to log everything somewhere. So we filled up our little SSDs pretty fast. Now we fixed this. I wrote some cloud init scripts and mounted fmrl storage and shoved it all out there.
[00:24:31] Oh man, what a hard lesson to learn at a very bad time. We also noticed while we were doing this, that when things started to go south, we opened a lot of connections to the database. And we started dragging it down with us. Nick Barrett and I got PgBouncer in place. And this thing's great, if you haven't used it before.
[00:24:52] Basically what it does is it sits in front of Postgres and allows a small number of connections through. And everybody else gets left out. Like a bouncer, right? After we put this in place, I don't think the CPU on my database ever went over 6%. No matter what I threw at it. It was very happy. So connections, they're important.
[00:25:13] Okay. So we've sorted out our problems. We're going to go ahead and retest with the same partner again, negotiate a new time, get ready to go. And some strange things started happening to them. It looked like we couldn't push the load to us. We were getting lots of strange socket errors. It looked like they were essentially picking up the phone and slamming it shut before taking the request.
[00:25:39] And we had a hard time convincing each other that our data was good. Because, again, we just canceled the load test. Do we even have any idea what we're doing right now? One of the ideas we were throwing around Was an ISP throttling us? Was AWS freight limiting us? After all, we're essentially running a DDOS against our own system.
[00:26:00] So we decided we needed to find out. So we built the locust cloud. Our system was running in Virginia and AWS and our load test infrastructure is running in Oregon. So we started setting up slaves and stubs all over the place. Ireland, Singapore, California, Brazil. We put nodes in Toronto and digital ocean and Linode and Atlanta.
[00:26:23] And we hammered our server. And it all worked. Nobody seemed to be throttling us, so that was good to know. And I just want to point out how hard this would have been if not for Eric Hankinson. So if Eric, or you'd like to get to know Eric, you should tweet him and tell him that he's a bash wizard.
[00:26:42] Essentially, he wrote a script that would pull all of our nodes from all of our accounts, figure out which ones were supposed to be locust slaves, copy the configuration file to all of those slaves, start them all up, Multiple processes per machine. Hold on to all the PIDs from these SSH connections. And then if I controlled C the script, it would just shut everything down cleanly.
[00:27:03] So I would just log into the master, say go, and control C, and that's all I had to do. It was great. It would have been so hard to coordinate that across all those different machines.
[00:27:17] While we were still testing internally, I had this problem where I couldn't seem to push our system past 120, 000 simultaneously. I would add more of those giant nodes to load balancer. I might get a hundred, a thousand more users, but that was it. This got pretty esoteric, but we ended up changing a lot of the TSV windowing settings in Ubuntu.
[00:27:40] Again, just come talk to me. I can point you at the articles I found. I think SoMaxCon was the important one here. And it makes sense. Ubuntu is not really designed for 32 core, 244 gig machines. There's a lot of extra resources there that we're not using. So once we got this tuning in place, we were able to get the kind of numbers we wanted.
[00:28:02] I think these are the best numbers I ever got. I'm running 150, 000 users over 98 slaves at almost 1200 requests per second. And I'm particularly proud of our sub second search times. It was pretty awesome. The more astute of you might notice that says 15, 000 and not 150, 000. We, when we built our user scripts, we actually had someone with a stopwatch as they went through the forums.
[00:28:26] And so we would do a little sleep there while the form was allegedly being filled in. And the problem with Locus or any tool like this is it takes a while to spin everything up and get everybody running. So we found that if we just cut all those by a tenth, we could also cut the number of users by a tenth and get the same amount of load.
[00:28:42] That was pretty nice. With our tuning we found we needed about eight web nodes. So you can see this is one of our shorter test runs where, again, we took all this time to bring it up to speed. And then we ran at about 80 to 90 percent, which is where I want my web nodes to be, before we shut it down.
[00:28:59] Here's our numbers at 200, 000 users. Doing pretty well. We're up to more like 4 second response times, which isn't great, but considering the load we were under, way better than just melting down. I was okay with that.
[00:29:15] So this is the architecture we actually ended up with. We basically added two boxes. One for PG Bouncer in front of Postgres, and one so Delayed Jobs got its own resources. We actually, this ended up being Rescue later on for a lot of reasons.
[00:29:32] So we retested with that provider, and everything went well. Yay! Eight days before launch. Next day, one of our other load test partners comes back and says, Hey, we analyzed the data from your load test, and it looks like you guys are doing A, B, C, and you really ought to be doing A, D. Now we're having this conversation seven days before we go live, whether or not we should change our code to make that more efficient.
[00:29:59] Again, if maybe if we'd started load testing a little earlier we would have had this conversation a little earlier. And then another provider the next day is oh by the way, based on what you did with us, we're running our own internal load test. And I'm freaking out, so what does that mean? Do I need to rerun all my tests based on changes you guys make?
[00:30:18] Six days before I'm building production right now. So other than starting earlier, the key thing I learned in this stage was you have to be nice to people. The point of load testing is to break things. We need to figure out where degradated performance happens and start thinking about contingency plans.
[00:30:40] And if you've been running your system in production for years, you're really not going to take kindly to someone coming and telling you that it's not very performant and it has all these problems. We've got to figure out ways to build trust and have conversations. Be able to agree on data, things like that.
[00:30:57] Be nicer. Okay, three days before launch. It's game time. If you're using AWS and you have an Elastic Load Balancer, it's important to know that the Elastic Load Balancer builds up as traffic builds up, which works great unless all of your traffic shows up in the same instance, which was our case. So you can ask them essentially to pre warm the ELB and they'll add all the necessary resources to make it work correctly.
[00:31:23] You just fill out a form. Amazon has limits on the number of nodes you can actually add to your account. If When I tried to put 32 of those jumbo nodes in my load balancer, they balked. I had to have a phone conversation and convince them that our credit was good and it was okay to do this. Don't want to have that conversation on launch day.
[00:31:44] And then anybody that's ever dealt with AWS before knows that if you need to start one instance to do one thing right before you walk out the door, it will, without fail, get stuck on one of two status checks. And you'll sit there and go, should I wait? Should I restart it? Start all your instances the day before, pay the extra few dollars to leave them running and you'll sleep much better.
[00:32:08] We did a soft launch. Everybody should do soft launches. These things are great. They helped us validate some of our assumptions. And find out that users do weird things. I never thought anyone would put that in a fax number field. That's crazy! Alright, let's do some validation checks for that. And let us find out that, with some of our SaaS providers, unlimited does not mean unlimited.
[00:32:36] There are limits, you should find out what they are. And it let us do some straight up hacks. We had one provider that said, we didn't get to load test with them beforehand, and they said that, it looks like you guys are sending us requests out of order. Maybe we're doing things asynchronously, that's weird, but, alright, here's what we'll do, we'll single thread all the requests to your server, put them in a queue.
[00:32:58] A terrible hack, but, three days before launch, or at this point, like two days before launch, It was a thing we could at least do. Alright, launch date. So someone tried to get my attention right before our window was going to open up. We basically had a window in place to say, You can't buy domains until 10am.
[00:33:19] And they're like Steve, there's 30, 000 people on the site. I'm like, eh, whatever. I've been testing with 200, 000. This is what was actually there. I did look up and take a screenshot, and this is the last one I took, so you have to take my word for it from here on out. But, five minutes beforehand, we had 5, 000 people banging on the gate trying to get their domains.
[00:33:40] Doing all that load testing was really good practice for the launch. I had my Google Analytics over here, my CloudWatch here, and my Postgres running some queries to keep track of delayed job. And I really felt like I was in control of the launch and what sort of things I needed to worry about. Which is no small thing if you've been running ops during something like that.
[00:34:00] Felt pretty good. But, even with all that practice, we still had problems. The one dependency I had not worried about at all. They certainly had the resources and infrastructure to handle us. Ended up being our biggest bottleneck. Backed up our orders for three days. So that was a bummer. But all in all, it was a success.
[00:34:27] We did not get 100, 000 users promptly at 10 a. m. We had somewhere between 10 and 20, and then a steady 10, 000 users for the rest of the hour. We sold 75, 000 domains in the first 48 hours. So overall success and everyone's pretty happy with what they got. All right, wrapping up what I learned. I was a big proponent of making as much asynchronous as possible.
[00:34:57] I'm the one watching these performance numbers and I know we can do better. And. Again, as I mentioned before, part of the problem here is that some of the things that you do will require your user's feedback to fix. And so that was part of the problem with the orders getting backed up, is that I had to figure out what the user was supposed to put in those fields.
[00:35:18] Job systems are also pretty hard to load test. So even if we hadn't had that sort of problem, maybe we wouldn't have done a good job of doing our load testing. There was also this synchronous check that we had in place and I argued it away because I felt like we're the only ones using this API in this particular way so we don't need to do that check.
[00:35:37] And we had some issues that may have been prevented if we had that check in place. So I still think about that a little bit. But I think our number one thing was that we didn't think a lot about failure conditions, which makes sense. When you're writing a task and you have to talk to, say, an API three times to complete that task.
[00:35:58] It's weird to think about, it worked the first time, it worked the second time, but it failed on the third call because the server was under so much load that it couldn't handle it. And so now you have to think about, were the first two tasks destructive, can I just rerun this whole thing, am I in an inconsistent state, where are things at?
[00:36:17] It's important to make your jobs idempotent. And I gained a lot of empathy for my users, as frustrating as it is to buy tech sometimes and it's oh, it's going to be a month until it gets delivered. At least they tell you up front, usually. In this case, someone thought they were getting a website, and they just had to hang out for a couple days while we sorted it out.
[00:36:38] And that sucked. I did not enjoy that. I spent a lot of time with this load test and this script. I really was trying to milk as much performance out of it as I could so we could support those 200, 000 users. But in the end, that script was not reality. My users did not do exactly what I thought they would do.
[00:36:59] And you get a lot of data out of these test runs. That's it. And there are a lot of little knobs that you can potentially tweak to make things better. So it's hard to be methodical about it. Especially when it's so expensive. We set up an entirely separate architecture to do our load testing. I can't really imagine doing it any other way.
[00:37:17] But that really makes it so I want to cut corners so I don't waste two hours of time 12. But all in all I'd say it was entirely worth it. We caught a lot of problems that we would have missed otherwise. It was good. At this point, these are the questions I'm left with, so if you have any ideas or perspectives on this, I'd love to hear them.
[00:37:43] I made the comment several times I thought we should have started earlier, but I don't know exactly when we should have started. Day one seems a little too early. We don't really have anything to load test. I'm wondering if it's maybe like continuous integration, where we've learned that if you put it off, it really hurts.
[00:38:00] So you should do it all the time, and then it only hurts a little bit. Maybe that makes it cheaper? I don't know, but if you have perspectives or thoughts on this, I'd love to hear them. That's all I have. Thank you so much for your time and attention. Please reach out to me. I'd love to hear more from you guys.