background preloader

Hypertable: An Open Source, High Performance, Scalable Database

Hypertable: An Open Source, High Performance, Scalable Database

Apache™ Hadoop™! Drill Drill Overview Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an IaaS service called Google BigQuery. High Level Concept There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. The Apache Drill team uses Chronon for testing. The Apache Drill team also uses the YourKit Profiling tools in development.

Simple immutable objects - Andrew Arnott We’re all familiar with immutable collections now. But immutability is only as immutable as it is deep. And an immutable collection of mutable objects may not provide the depth you’re looking for. Suppose you would define the mutable version like this: public class Fruit { public string Color { get; set; }} An immutable version might be defined like this: public class ImmutableFruit { private readonly string color; public ImmutableFruit(string color) { this.color = color; } public string Color { get { return this.color; } } } Now that’s fine for very simple objects. We can make a couple more enhancements though. Another enhancement we can make is to make all constructors private. What does the resulting immutable object look like? Of course adding more properties will be common, and the code will increase a bit more with each one. So what can we do to simplify immutable programming? Consider this simple definition of a mutable type: class Fruit { string color; }

SimpleDB Amazon SimpleDB est un stockage de données non relationnel combinant flexibilité et haute disponibilité, et déchargeant le client des tâches d'administration de base de données. Les développeurs stockent et récupèrent simplement leurs éléments de données en effectuant des requêtes auprès des services Web, et Amazon SimpleDB fait le reste. Libéré des exigences strictes des bases de données relationnelles, Amazon SimpleDB est optimisé pour offrir une disponibilité et une flexibilité élevées, avec peu ou pas de tâches d'administration. En coulisses, Amazon SimpleDB crée et gère automatiquement plusieurs réplicas de vos données diffusés géographiquement pour permettre une haute disponibilité et une durabilité des données. Le service ne vous facture que les ressources réellement consommées lors du stockage de vos données et du traitement de vos demandes. Vous pouvez changer votre modèle de données n'importe quand, et les données sont automatiquement indexées pour vous.

Apache Drill Speed is Key Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. Liberate Nested Data Perform interactive analysis on all of your data, including nested and schema-less. Flexibility Strongly defined tiers and APIs for straightforward integration with a wide array of technologies. Disclaimer Apache Drill is an effort undergoing incubation at The Apache Software Foundation sponsored by the Apache Incubator PMC.

Software The MPI.NET source code is available as Open Source software under the Boost Software License (details). However, most users of MPI.NET will want to download either the runtime (for installation on cluster compute nodes) or the SDK (for developing programs using MPI.NET) binaries. On Windows, MPI.NET requires Microsoft's MPI implementation (MS-MPI), which can be installed one of two ways: HPC Pack 2008 SDK or Microsoft Compute Cluster Pack SDK: includes MS-MPI and the various headers that one needs if writing MPI programs in C or C++ without MPI.NET. Recommended for most users, because it installs on Windows XP and Windows Vista. Version 1.0.0 This is the first major release of MPI.NET, including improved documentation, better support for MPI 1.1 and 2.0, and various bug files. Version 0.9.0 This release of MPI.NET provides critical bug fixes for transmission of serialized data and fixes the returned status from receive operations, along with some documentation improvements. Version 0.8.0

Hadoop – The Power of the Elephant — eBay Tech Blog In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop. Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware. MapReduce is a flexible programming model for processing large data sets:Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs. Map (key1, value1) -> list (key2, value2)Reduce (key2, list (value2)) -> list (key3, value3) Ecosystem Athena, our first large cluster was put in use earlier this year. Infrastructure

AWS | Amazon Redshift – Cloud Data Warehouse Solution It’s never been easier to get file data into Amazon Redshift, using AWS Lambda. You simply push files into a variety of locations on Amazon S3 and have them automatically loaded into your Amazon Redshift clusters. Read more in A Zero-Administration Amazon Redshift Database Loader (April 2015). Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift’s data warehouse architecture allows you to automate most of the common administrative tasks associated with provisioning, configuring and monitoring a cloud data warehouse. Security is built-in. Amazon Redshift uses a variety of innovations to obtain very high query performance on datasets ranging in size from a hundred gigabytes to a petabyte or more. You pay only for the resources you provision. Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster.

Trinity General purpose graph computation faces a great challenge of random data access. Meanwhile, the RAM capacity limit forms a scale bound of single machine solutions for general purpose graph processing. Trinity is a general purpose distributed graph system over a memory cloud. Memory cloud is a globally addressable, in-memory key-value store over a cluster of machines. Through the distributed in-memory storage, Trinity provides fast random data access power over a large data set. Features of Trinity: Trinity can run in both embedded (in-process) and distributed mode. Project Contacts Bin Shao Jeff Chen Wei-Ying Ma

Related: