background preloader

Big Data

Facebook Twitter

MongoDB Performance Tuning: Everything You Need to Know. MongoDB is one of the most popular document databases.

MongoDB Performance Tuning: Everything You Need to Know

It’s the M in the MEAN stack (MongoDB, Express, Angular, and Node.js). Unlike relational databases such as MySQL or PostgreSQL, MongoDB uses JSON-like documents for storing data. MongoDB is free, open-source, and incredibly performant. However, just as with any other database, certain issues can cost MongoDB its edge and drag it down. In this article, we’ll look at a few key metrics and what they mean for MongoDB performance. Performance of locking in transactionsMemory usageConnection handlingIssues with replica sets Of course, MongoDB performance is a huge topic encompassing many areas of system activity. Now, let’s get into it. Analyze locking performance How does MongoDB handle locking? For example, if a client attempts to read a document that another client is updating, conflicts can occur. When a lock occurs, no other operation can read or modify the data until the operation that initiated the lock is finished. Observations About Streaming Data Analytics for Science.

I recently had the pleasure of attending two excellent workshops on the topic of streaming data analytics and science.

Observations About Streaming Data Analytics for Science

A goal of the workshops was to understand the state of the art of “big data” streaming applications in scientific research and, if possible, identify common themes and challenges. Called Stream2015 and Stream2016, these meetings were organized by Geoffrey Fox, Lavanya Ramakrishnan and Shantenu Jha. The talks at the workshop were from an excellent collection of scientists from universities and the national labs and professional software engineers who are building cloud-scale streaming data tools for the Internet industry. First it is important to understand what we mean by streaming data analytics and why it has become so important. Most scientific data analysis involves “data at rest”: data that was generated by a physical experiment or simulation and saved in files in some storage system.

This article has two parts. Figure 1. Algorithms and Analysis Like this: Scheduling in Hadoop. Hadoop is a general-purpose system that enables high-performance processing of data over a set of distributed nodes.

Scheduling in Hadoop

But within this definition is the fact that Hadoop is a multi-tasking system that can process multiple data sets for multiple jobs for multiple users at the same time. This capability of multi-processing means that Hadoop has the opportunity to more optimally map jobs to resources in a way that optimizes their use. Up until 2008, Hadoop supported a single scheduler that was intermixed with the JobTracker logic.

Although this implementation was perfect for the traditional batch jobs of Hadoop (such as log mining and Web indexing), the implementation was inflexible and could not be tailored. Further, Hadoop operated in a batch mode, where jobs were submitted to a queue, and the Hadoop infrastructure simply executed them in the order of receipt. Luckily, a bug report (HADOOP-3412) was submitted for an implementation of a scheduler that was independent of the JobTracker. Top 10 data mining algorithms in plain English.

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Top 10 data mining algorithms in plain English

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining. What are we waiting for? Let’s get started! Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post. Update 28-May-2015: Thanks to Dan Steinberg (yes, the CART expert!) What does it do? Wait, what’s a classifier? What’s an example of this? Now: Given these attributes, we want to predict whether the patient will get cancer. And here’s the deal: Using a set of patient attributes and the patient’s corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

Cool, so what’s a decision tree? The bottomline is: Is this supervised or unsupervised? Statistique décisionnelle, Data Mining, Scoring et CRM. Open Source Scalable SQL Database Cluster. Introduction to Pig Latin - Programming Pig.