Building a Search Engine with Udacity/
December 11, 2012
For SEOs, the web crawler is a powerful tool. When conducting technical audits, competitive analyses, what have you, we use web crawlers (like Xenu’s Link Sleuth or Screaming Frog’s SEO Spider) to navigate internal linking structures and collect data. These handy utilities take much of the human effort out of discerning top-level page attributes. Feed in a starting URL and—should fortune favor the HTML—you’ll receive the titles, meta descriptions, server response codes, etc. of a healthy selection (if not all) of the website’s URLs (not to mention the URLs themselves).
Having near-immediate access to these various page attributes is valuable for a number of reasons. We can efficiently:
- Pinpoint ill-advised 302 redirects
- Augment 404 tallies pulled from Google/Bing Webmaster Tools
- Find pages with multiple first-level headers
- Ascertain specified canonical URLs
The list goes on and on (until it stops), really. Every day, SEOs are coming up with new, creative ways to make use of web crawlers. The guys from Distilled have offered up insights on both Screaming Frog and Xenu’s Link Sleuth. If you’re interested in getting more out of your crawls (web, not bar), I encourage you to check out these resources.
Comprehending the Crawler
Learning to use a web crawler in simple application is fairly straightforward. However, if you’re a curious user (like me), you’ll find yourself wondering, “How does this thing work?”. And, furthermore, “How does it collect all of this data so quickly?”. In fact, as an SEO, you may have started asking these questions long before you executed your first crawl in Screaming Frog.
“If I wanted to create my own search engine, how would I get started?”
While I never actively sought the programmatic answer to this question, I suppose that I always asked it subconsciously. Thus, when I stumbled upon Udacity’s Introduction to Computer Science course—”Building a Search Engine”—I was floored with excitement. The SEO in me jumped at the opportunity to garner a more comprehensive understanding of the programs that power our industry. Needless to say, I enrolled.
Learning with Udacity
Udacity is one of a number of startups trying (and succeeding, I might add) to make high-level education accessible and affordable. Courses are generally six weeks long and consist of interactive lessons taught by university professors and industry professionals. The “Building a Search Engine” course is taught by David Evans, a Professor of Computer Science at the University of Virginia. While the course does demand a good chunk of time, I can promise you that it’s worth it (it’s completely free!). Here’s an introduction to the course:
Sounds interesting, right? You may be thinking, “Why do I need to learn about the fundamental, programmatic building blocks of the search engine to understand its functionality?”. Well, really, you don’t. Often times, though, understanding the underlying machinery (even at a simple level) of a given concept (be it software or sport) provides for a more exact comprehension of the concept itself. Ever find yourself talking to a client, saying something along the lines of “Google finds these internal pages on your website by crawling links from your home page.”? Probably. But what does “crawling links” mean? What’s happening programmatically when Googlebot encounters an anchor tag? How do crawlers extract pertinent data from the pages that they crawl?
While finding the answers to these questions mightn’t be a task that you deem ‘necessary’, I daresay that there are plenty of benefits to be had. Chief amongst these benefits is the ability to speak—to clients, to peers, at meetups, at conferences, in blog posts—not just accurately, but confidently, as well. Plus, it’s a whole lot of fun and, if you haven’t yet had an introduction to Computer Science, Python is a great place to start.
Is this required material for SEOs? Absolutely not. Do I highly recommend it? Absolutely. If you build it, no one will come (they have Google, Bing, DuckDuckGo, etc.), but you’ll certainly understand search engines a little more clearly. In an industry where “understanding the search engines” is one of the integral tasks, I’d say that this free opportunity is an incredibly rewarding one. Give it a try.