background preloader

Big Data

Facebook Twitter

Data Science at Mark Needham. Archive for the ‘Data Science’ Category Data Science: Don’t build a crawler (if you can avoid it!)

Data Science at Mark Needham

On Tuesday I spoke at the Data Science London meetup about football data and I started out by covering some lessons I’ve learnt about building data sets for personal use when open data isn’t available. When that’s the case you often end up scraping HTML pages to extract the data that you’re interested in and then storing that in files or in a database if you want to be more fancy. Ideally we want to spend our time playing with the data rather than gathering it so we we want to keep this stage to a minimum which we can do by following these rules. Weka 3 - Data Mining with Open Source Machine Learning Software in Java. Weka is a collection of machine learning algorithms for data mining tasks.

Weka 3 - Data Mining with Open Source Machine Learning Software in Java

The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature. Hazy. CommonCrawl.

Document Indexing

Visualization. Map Reduce. EMR. Hadoop. NoSQL.