background preloader

Datamining

Facebook Twitter

Textmining

Data Mining the Web: $100 Worth of Priceless. Summary I have always been a fan of what I call audacity in engineering, and today’s toolsets make work at Web scale not only possible, but economically feasible, even for lone engineers toiling in tiny startups.

Data Mining the Web: $100 Worth of Priceless

We can not only think big, but with the right tools, can execute on a scale never before imagined outside of large corporations or universities. A few weeks ago, while working on prototype search technology for Lucky Oyster, we were able to leverage a few simple components—data from Common Crawl, Spot Instances from AWS, a few hundred lines of Ruby, and assorted Open Source software—to data mine 3.4 billion Web pages, extracting close to a terabyte of structured data, and building a searchable index of close to 400 million entities. CommonCrawl. Correlate.