When we’re working on fixing an immediate problem, especially one that’s affecting customers, it’s difficult to stop and take a breather. But sometimes, a breather is exactly what is needed to solve the issue.Â
One Step Back
Last month was a bit rough for our Big Data team. We spent most of the month heads-down fixing issues with Rankings and Keyword Difficulty, and our technical debt was creeping up on us. I wanted to give into my natural urge to hunker down, chew on the issues, and come up with a plan that would fix as much as I could. However, I had a weekly 1 on 1 meeting scheduled that seemed to be getting in the way of my plan to lay low and problem solve.
Here at Moz, each employee attends weekly or bi-weekly 1 on 1 meeting with managers or teammates to help keep our goals on track. 1 on 1 meetings are a chance for teammates to act as soundboards for project ideas and idea generators for solutions to issues. These meetings are an important part of our culture, but on this particular day my focus was elsewhere and I didn’t feel I had time for my 1 on 1 with Matt Peters, our rock star data scientist. Realizing that we had missed our last meeting, I begrudgingly made time to fit the meeting in. After our usual good talk on algorithms, correlations, and next steps for growing his team, we started bouncing ideas off each other on how to save money on processing. We were spending $800,000 on processing and not really getting anything for it. The current plan was simply unsustainable.Â
Matt, in his very scientific way, broke down the problem in exact numbers. I, however, will break them down for you in a very Anthony way:
- Long-term, we knew we needed to fix the issues we were having with Amazon, but we were reacting to missing our index release date instead.
- Short-term, it seemed sensible to spin up more servers and get the index done more quickly.
- In reality, spinning up more servers at Amazon was only increasing our costs, and our server failures. The current solution was not only not addressing the problem, but in some ways it was making the problem worse by taking time away from the team’s efforts to fix the long-term issues. Â
Taking a step back from the immediate problem made it clear that our current approach wasn’t working.
*Server photo by Kim Scarborough used through creative commons license.
Coming Up with a Better Plan
After the insight I gained in my 1 on 1 with Matt, it was clear we needed to change our approach. Matt and I and outlined a high-level plan for lowering our costs with the added potential bonus of getting indices out on time. We figured it might be a hard sell after telling the team, “Don’t miss the date at all cost,†for the last two months. They’d spent hundreds of hours trying to keep all of those servers up, and we weren’t sure how open to this change they would be.
However, Carin, our stellar Manager of Big Data, brought the team together and we all agreed on the plan. Carin outlined the issues and then proposed the new approach in this snippet from her email to Rand:
The New Plan:
- Run two indexes at most in AWS:
- One cluster on 80 cc2.8xlarge machines – these are HUGE and more expensive, but should complete an index in less time, making them cheaper over the month.
- If necessary, run a backup index on 200 smaller c1.xlarge machines (current setup).
- Continue to maintain an index size of 60 – 70 billion URLs to keep processing time reasonable.
This plan allows for engineering time to tackle the larger problems: develop a testing environment and improve the Mozscape code base. Most importantly, though, we can distribute PLDs across processing shards in a more efficient manner, which could lead to significant time savings in processing.
Two Steps Forward
Luckily, Rand approved the plan, and the time and energy spent to take a step back really paid off. Newer, better, bigger equipment did the job, with no server failures and no operational headaches. The October index release is the result of the change. It finished in record time and only cost $100,000, compared to the $800,000 spent last month.
*Server photo by Kim Scarborough used through creative commons license.
We learned quite a few things from this experience, but this was our most important takeaway: the times when you feel like you don’t have time to step back and reassess are exactly the times when you should. It may not always save you $700,000, but there is a chance that it might. The time spent gaining a new perspective can bring solutions to light that you’d have never seen if you’d kept that nose to the grindstone!Â
We are hopeful that future indexes run as smoothly as October, and if they don’t, we’ll remember our own advice and take a step back before moving forward.