background preloader

Dataset

Facebook Twitter

Open Directory RDF Dump. Twitter Census: Publishing the First of Many Datasets | blog.inf. As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find. Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006.

The initial datasets are a part of our Twitter Census collection. The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Factual. Public Datasets « Elastic Web Mining | Bixolabs. This is a page where we list public datasets that we’ve used or come across. Comments, corrections, and additional data sources are welcome! We use datasets for consulting projects, and when we need some juicy data for labs that are part of our big data training courses. There’s also some slightly out-of-date information from an ACM event that you can find here.

We’ve also started a separate list of commercial datasets. The information below is organized by the type of data – e.g. Some of this information comes from other lists we’ve found, including: Data Files Wikipedia – complete data dump for site, in MediaWiki data files. APIs Note that for many these, there are restrictions on number of requests/day and usage of the data. Delicious – social network site for link sharing. Databases Freebase – open database of people, places and things.FLOSSMole – has database of open source projects.ImageNet – an image database organized according to the WordNet hierarchy. Web Pages. Machine Learning Repository: Netflix Prize Data Set. WebBase Project. Deliclious data. Dataset.