Data Science at Mark Needham. Archive for the ‘Data Science’ Category Data Science: Don’t build a crawler (if you can avoid it!)
On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available. When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy. Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Weka is a collection of machine learning algorithms for data mining tasks.
The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature. Hazy. CommonCrawl.
Visualization. Map Reduce. EMR. Hadoop. NoSQL.