background preloader

Architecting the future of big data

Architecting the future of big data

Related:  Open DataWP 3 NoSQL Big Data

Data Platform Not only open-source, but built in the open. HDP demonstrates our commitment to growing Hadoop and it’s sub-projects with the community and completely in the open. HDP is assembled entirely of projects built through the Apache Software Foundation. How is this different from open-source, and why is it so important? C10k problem The C10k problem is the problem of optimising network sockets to handle a large number of clients at the same time.[1] The name C10k is a numeronym for concurrently handling ten thousand connections.[2] Note that concurrent connections are not the same as requests per second, though they are similar: handling many requests per second requires high throughput (processing them quickly), while high number of concurrent connections requires efficient scheduling of connections. In other words, handling many requests per second is concerned with the speed of handling requests, whereas a system capable of handling high number of concurrent connections does not necessarily have to be a fast system, only one where each request will deterministically return a response within a (not necessarily fixed) finite amount of time. The problem of socket server optimisation has been studied because a number of factors must be considered to allow a web server to support many clients. History[edit]

Weka 3 - Data Mining Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. About Kaggle and Crowdsourcing Data Modeling Kaggle is the world's largest community of data scientists. They compete with each other to solve complex data science problems, and the top competitors are invited to work on the most interesting and sensitive business problems from some of the world’s biggest companies through Masters competitions. Kaggle provides cutting-edge data science results to companies of all sizes. We have a proven track-record of solving real-world problems across a diverse array of industries including life sciences, financial services, energy, information technology, and retail. Read more about our solutions »

Sandbox Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes! Learn HadoopSandbox comes with a dozen hands-on tutorials that will guide you through the basics of Hadoop; tutorials built on the experience gained from training thousands of people in our Hortonworks University Training classes. Build a Proof of ConceptThe Sandbox includes the Hortonworks Data Platform in an easy to use form. Latest As I mentioned in my previous post, our collaboration with the Sabeti Lab is aimed at creating new visual exploration tools to help researchers, doctors, and clinicians discover patterns and associations in large health and epidemiological datasets. These tools will be the first step in a hypothesis-generation process, combining intuition from expert users with visualization techniques and automated algorithms, allowing users to quickly test hypothesis that are “suggested” by the data itself. Researchers and doctors have a deep familiarity with their data and often can tell immediately when a new pattern is potentially interesting or simply the result of noise. Visualization techniques will help articulate their knowledge to a wider audience.

This is What a Tweet Looks Like Think a tweet is just 140 characters of text? Think again. To developers building tools on top of the Twitter platform, they know tweets contain far more information than just whatever brief, passing thought you felt the urge to share with your friends via the microblogging network. A tweet is filled with metadata - information about when it was sent, by who, using what Twitter application and so on. Now, thanks to Raffi Krikorian, a developer on Twitter's API/Platform team, you can see what a tweet looks like, in all its data-rich detail. Via a weekend post on Krikorian's blog, there comes an embedded document that shows what a mapped out tweet looks like. The DataSift Platform Social data is noisy. Whether you’re trying to social analyze trends within an industry, or mentions of your products or brands, you need a platform that can filter out the noise and allow you to focus on the data that’s most relevant to you. This is especially important when you are paying for the social data you receive. At the heart of the DataSift platform is a high-performance filtering engine with which you can find the exact content and conversations that are relevant to your business.

16 Top Big Data Analytics Platforms Teradata delivers unified big data architecture Analytical DBMS: Teradata, Teradata Aster.In-memory DBMS: Although not an in-memory DBMS, Teradata Intelligent Memory monitors queries and automatically moves the most-requested data to the fastest storage tiers available, with options including RAM, flash, SSD, and various speeds of conventional spinning discs.Stream-analysis option: None.Hadoop distribution: Resells and supports the Hortonworks Data Platform. Hardware/software systems: Teradata and Teradata Aster are integrated software/hardware systems.

Infographic Of The Day: Bloomberg And Frog Turn Raw Data Into Branding Bloomberg is a sprawling, multi billion-dollar enterprise, which creates a distinct problem if you’re trying to explain what the company actually does. They do lots of things, ranging from law research to sports research for team managers to, of course, stock-market data crunching. "Many people have a single association with Bloomberg, as a wire service or a market-data provider," says Jen Walsh, Bloomberg’s head of digital marketing. "We wanted our website to shine a light on other aspects of the business."

REST API Rate Limiting in v1.1 Per User or Per Application Rate limiting in version 1.1 of the API is primarily considered on a per-user basis — or more accurately described, per access token in your control. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window per leveraged access token.

he only vendor which uses 100% open source Apache Hadoop without own (non-open) modifications. Hortonworks is the first vendor to use Apache HCatalog functionality for metadata services. Besides, their Stinger initiative optimizes the Hive project massively. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks developed and committed enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure. by sergeykucherov Jul 15

Related:  Data ManagementWWWBig Data