Data

> > > > > > > >

First 5,000 Tags Released to the Linked Data Cloud. For more than 150 years, The New York Times has meticulously indexed its archives.

Through this process, we have developed an enormous collection of subject headings, ranging from “Absinthe”[1] to “Zoos”[2]. Unfortunately, our list of subject headings is an island. For example, even though we can show you every article written about “Colbert, Stephen [3],” our databases can’t tell you that he was born on May 13, 1964, or that he lost the 2008 Grammy for best spoken word album to Al Gore. To do this we would need to map our subject headings onto other Web databases such as Freebase and DBPedia. So that’s exactly what we did. Over the last several months we have manually mapped more than 5,000 person name subject headings onto Freebase and DBPedia.

And because we want to make sure that this data gets used as widely and freely as possible, we are very pleased to announce that all data records released at will be published under a Creative Commons 3.0 Attribution License. [1] [2] [3] Numbers Everyone Should Know. When you’re designing a performance-sensitive computer system, it is important to have an intuition for the relative costs of different operations. How much does a network I/O cost, compared to a disk I/O, a load from DRAM, or an L2 cache hit?

How much computation does it make sense to trade for a reduction in I/O? What is the relative cost of random vs. sequential I/O? For a given workload, what is the bottleneck resource? When designing a system, you rarely have enough time to completely build two alternative designs to compare their performance. Back-of-the-envelope analysis. Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Datasets Archive. If you have an interesting dataset, or collection of data from a book, please consider submitting the data.

To submit a dataset, please see the submissions guidelines, via Some of the entries are shar archives. If you don't know how to deal with a shar archive, send the message for instructions. The datasets archive currently contains: NIST Statistical Reference Datasets (StRD) A pointer to a NIST site that contains reference datasets for the objective evaluation of the computational accuracy of statistical software. Agresti. Clean your data. September 08, 2009, 8:14 PM — Think of it this way: when a procedural drama enacts a climactic scene in which someone announces, "the computer found a match!

We've got our 'perp' ... ", I have trouble doing anything but giggling. While there must be a healthier response, I haven't figured it out yet. No individual programmer can solve this. What you can do is to know the facts. The data which your programs manage and process are inaccurate. Take advantage of the techniques at hand: use Assert()s and similar. Computational information design. 175+ Data and Information Visualization Examples and Resources. Things wordy, geeky, and webby Since taking a class that discussed Edward Tufte‘s work, I’ve been fascinated by turning information into visual data.

175+ Data and Information Visualization Examples and Resources

His site contains many examples that you could easily spend hours on the site. Rise of the Data Scientist. As we've all read by now, Google's chief economist Hal Varian commented in January that the next sexy job in the next 10 years would be statisticians.

Obviously, I whole-heartedly agree. Heck, I'd go a step further and say they're sexy now - mentally and physically. However, if you went on to read the rest of Varian's interview, you'd know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts. PhotoRec. Latest stable version 7.0 April 18, 2015 PhotoRec, Digital Picture and File Recovery PhotoRec is file data recovery software designed to recover lost files including video, documents and archives from hard disks, CD-ROMs, and lost pictures (thus the Photo Recovery name) from digital camera memory.

PhotoRec ignores the file system and goes after the underlying data, so it will still work even if your media's file system has been severely damaged or reformatted. Purely Applicative XML Cursor. Japanese abstract Cursor model is a relatively new approach for XML processing.

In this model, a cursor acts like a lens that focuses on one node. You can freely move the cursor back and forth in an XML document, and edit the node it indicates. This model can be easily implemented in imperative language like C or Java, by using a pointer to subtree in the XML tree as the cursor. We propose a purely functional data structure named “Slit” to realize a cursor on a tree efficiently in applicative manner.

Download .bib. Syncing vs. saving, and the case for a home storage cloud. This article was inspired by a post from Steve Foskett on Dell's The Future of Storage site.

In his post, Foskett tries to make the case for the "Home SAN. " While I'm not convinced that the answer to all my home storage problems is a "SAN," like Foskett proposes, I do agree that something has to be done. Check out my proposed solution, and sound off in The Server Room.