Data Mining the Web: $100 Worth of Priceless. Summary I have always been a fan of what I call audacity in engineering, and today’s toolsets make work at Web scale not only possible, but economically feasible, even for lone engineers toiling in tiny startups.
We can not only think big, but with the right tools, can execute on a scale never before imagined outside of large corporations or universities. A few weeks ago, while working on prototype search technology for Lucky Oyster, we were able to leverage a few simple components—data from Common Crawl, Spot Instances from AWS, a few hundred lines of Ruby, and assorted Open Source software—to data mine 3.4 billion Web pages, extracting close to a terabyte of structured data, and building a searchable index of close to 400 million entities. CommonCrawl. Correlate.