
BayesiaLab 5.1: Analytics, Data Mining, Modeling and Simulation BayesiaLab raises the benchmark in the field of analytics and data mining. The improvements range from small practical features to entirely new visualization techniques that can transform your understanding of complex problems. Bayesia starts off 2013 with countless innovations in the newly-released BayesiaLab 5.1. Here is a small selection of the features that have been introduced in version 5.1: A Comprehensive Mapping Tool offering an entirely new way to visualize and analyze networks Occurrence Analysis for diagnosing sparse conditional probability tables Binary Clustering and Multiple Clustering to create latent variables with logical expressions Enhanced Resource Allocation Optimization and Target Optimization Design of Experiments Tool for generating questionnaires A new Radial Layout, plus the world's first Distance Mapping layout based on Mutual Information Box Plots for analyzing the distributions of numerical variables
boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Commercial support is available through Kohlschütter Search Intelligence. (2011-06-06) boilerpipe 1.2.0
geoloqi/MapAttack - GitHub Processing Qualitative Research Data With Tinderbox I wrote a while back that I often use a piece of software for the Mac called Tinderbox to churn through messy, unstructured focus group data and see the meaning and inherent structure in a soup of qualitative data. I was fortunate to be asked to present my method at a Tinderbox Weekend last November by Tinderbox auteur Mark Bernstein. It's a complicated process at the start, but once it's set up correctly you can zip through qualitative research data pretty quickly and develop structure in the process. Mark (and Eastgate's Stacy Mason) have been noodging me to make a screencast of this process, and I've finally gotten around to doing just that. How to crawl a quarter billion webpages in 40 hours More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. What does it mean to crawl a non-trivial fraction of the web? Code: Originally I intended to make the crawler code available under an open source license at GitHub. There’s a more general issue here, which is this: who gets to crawl the web? I’d be interested to hear other people’s thoughts on this issue. Architecture: Here’s the basic architecture: The master machine (my laptop) begins by downloading Alexa’s list of the top million domains . Problems for the author
Princeton or Prison: Which is More Expensive? Each month, when over $400 is automatically deducted from my checking account, I can’t help but wonder… why did I choose to go to a private university that cost in excess of $20,000/year (with scholarships)? And why did 17 year-old me think it was okay to imprison then-future me to 17 years of debt? So, it’s not particularly surprising that when I came across this infographic on Fast Co. Design comparing the cost of ivy league higher education and incarceration, I took pause. Perhaps one of the causes of such large individual student loan debt is the fact that at a federal level, more money is spent on corrections than higher education. In addition, a handful of states, including New Jersey, also spend more on incarceration than universities. It’s prison. New Jersey has an inmate population of 26,757.
The Personal Wiki System ConnectedText is used in a variety of ways and in many contexts. I am always surprised to hear how other people use it, and the way I use it will probably appear just surprising as their use of the program will be to me. This essay is just my attempt to show how and why I use it for my research. I do not want to suggest that my way is the only or perhaps even the best way of using it. I am a 60 years old academic teacher and I have been using ConnectedText exclusively since August 2005 to keep my research notes and other bits of information. Will Duquette's Notebook (from May 2003 to August 2007). [1] Wikit (from the end of 2002 until May 2003) [2] and between 1985 and the end of 2002: InfoHandler, Ecco, InfoSelect, Packrat, Agenda, and Scraps for DOS, as well as MS Word (in its many incarnations). [3] I also experimented with many other so-called "PIMS," databases and other programs that promised to be useful for keeping research notes, but never really committed to any others.
Google Maps Mania Introduction to R for Data Mining This on-demand webinar shows how to become immediately productive in R, and covers point-and-click data mining GUI rattle, command line data mining, and Big data mining with RevoScaleR Feb 14, 2013 Webinar, presented by Joseph Rickert, Technical Marketing Manager, Revolution Analytics In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. Provide an orientation to R's data mining resources Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data. Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R. Here is the webinar replay and presentation. Read more.
johnkeefe.net - Journalism technology + information design Desktop Public Edition Compare the desktop editions of the Lavastorm Analytics Engine. Please ensure that your PC meets the following minimum requirements and has administrative privileges to your local machine: RAM - 2 GBHDD - Over 1 GBCPU - Dual Core 2 GHz x86 or x64 processor (Intel/AMD)O/S - Microsoft Windows® XP SP 3, Vista, or 7 (32 or 64-bit) Lavastorm Analytics Library Packs – Enhancements for Your Lavastorm Software The Lavastorm Analytics Library contains business controls (we call them nodes) with pre-built functions for Analytics, Data Acquisition, Correlation, Aggregation, Transformation, Reporting, Publishing, Logistics, Profiling and Patterns, Metadata and Structure, and Interfaces and Adapters. View all nodes and download the pack now.