Enter #d11hack — a hackathon, where participants were required to build the fastest web crawler possible. With some really cool merchandise & prizes up for grabs, it was just a matter of time before everyone stormed our booth and got down to solving the problem.
Right from the beginning, our booth was abuzz with curious participants and the energy was palpable in everyone who visited us. The excitement at the venue could also be sensed on Twitter, where hashtags such as #d11hack & #jsfoo were catching on with the crowd.
We started off brainstorming almost a week prior to the event about various problems that might catch people’s interest. The Web Crawler had the right balance of complexity & fun, and also adhered to our philosophy of —
Great challenges are easy to start with but tough to master
Building a web crawler is essentially a Graph traversal problem. You’re given a website, where pages have links to other pages. It is possible that some of them link back to pages you’ve already crawled, and your crawler might end up in a loop. Thus, one needs to keep a reference of all the visited pages and make sure that they aren’t crawled again.
Loading a page takes time. Crawling it serially is not the most efficient way & ideally, you should parallelize as much as possible. Since each page has links to multiple other pages, it is better to request them in parallel and continue parsing as an when the response is received.
To make people’s lives easier, we created an open source GitHub repositorywith the boilerplate code and a unit test. At Dream11, we take TDD to heart — so we added unit tests beforehand to help developers test & benchmark their code once they’re done. We also integrated the project with Travis, so that their code could be automatically executed on a standard machine setup. This helped us save a lot of time later, when we added our hidden test cases.
We also added some caveats to the problem statement. For instance, while crawling you need to simultaneously find the lexicographically smallest word on the website. Furthermore, the web server automatically starts throttling the response after sometime and eventually sends a 429 status code. You can use any library, but your solution must pass all our test cases. We started with one test case, so that people could run it locally on their systems for instant feedback. But then we observed that people were submitting solutions with a run time of 10s!!! (as compared to our internal benchmark of around 34s). At the end of Day 1, we added some more test cases just to be fair to those who were taking longer to crawl the website. We made this change overnight and back merged all the 21 pull requests received on Day 1. As expected, a lot of builds started to fail.
Everyone had to submit their solutions latest by 3:30 pm on Day 2. People were anxious because their builds were failing and the deadline was fast approaching. At the same time, we announced that we’ll be adding some more hidden test cases, for which their solution must work. This prompted everyone to quickly fix their code, remove all hard coded settings and update their pull requests.
Moment of Truth
We were surprised to receive some fascinating solutions, most of which were under 30s! This was much lesser than our original estimation of 34s. In a bid to top our ‘Leaderboard’, people were trying to beat each other’s time by implementing various hacks such as:
- Throttling requests based on the size of the graph
- Setting a maximum concurrency on their crawler
- Exploiting hints returned by the server in special headers such as X-RateLimit-Limit and X-RateLimit-Remaining
Although the submissions were supposed to pass all test cases, we considered only the run time of the first test case given on Day 1, for which people had spent maximum time optimizing.
And after being flooded with pull requests throughout, we finally got down to the winner declaration process. With run times of 29.37s & 29.67srespectively, Rahul Kadyan & Varenya emerged as clear winners on our leaderboard!
People were so intrigued by our problem statement that the buzz continued even after the deadline had long passed. We were constantly receiving pull requests on our GitHub repo & the run times kept getting impressive!
Overall, our presence at JSFoo helped us connect with some great people and the Hackathon brought out the competitive spirit in engineers, resulting in some really interesting solutions.