Welcome to Apache Avro!

Benchmarking - thrift-protobuf-compare - Project Hosting on Go The wiki moved to For discussions please use Started with few blog posts and with the help of many contributes, this project is now benchmarking much more then just protobuf and thrift. Thanks to all who looked at the code, contributed, suggested and pointed bugs. Three major contributions are from cowtowncoder who fixed the stax code, Chris Pettitt who added the json code and David Bernard for the xstream and java externalizable. Benchmarks can be very misleading. Setup The following measurements were performed with revision r128 on Windows 7 64-bit using Sun's JVM 1.6.0_15 JRE 32-bit, with an Intel Core i7 920 CPU. Omitted from the first three charts: json/google-gson and scala. Total Time Including creating an object, serializing and deserializing: Serialization Time Serializing with a new object each time (object creation time included): Deserialization Time Serialization Size

Database Access with Hadoop Editor’s note (added Nov. 9. 2013): Valuable data in an organization is often stored in relational database systems. To access that data, you could use external APIs as detailed in this blog post below, or you could use Apache Sqoop, an open source tool (packaged inside CDH) that allows users to import data from a relational database into Apache Hadoop for further processing. Sqoop can also export those results back to the database for consumption by other clients. Apache Hadoop’s strength is that it enables ad-hoc analysis of unstructured or semi-structured data. This blog post explains how the DBInputFormat works and provides an example of using DBInputFormat to import data into HDFS. DBInputFormat and JDBC First we’ll cover how DBInputFormat interacts with databases. Reading Tables with DBInputFormat The DBInputFormat is an InputFormat class that allows you to read data from a database. Configuring the job To use the DBInputFormat, you’ll need to configure your job. Retrieving the data

Jason’s .plan If you like Rabbit and Warrens checkout RabbitMQ in Action in the sidebar. The goal was simple enough: decouple a particular type of analysis out-of-band from mainstream e-mail processing. We started down the MySQL road…put the things to be digested into a table…consume them in another daemon…bada bing bada boom. But pretty soon, complex ugliness crept into the design phase… You want to have multiple daemons servicing the queue? You get the idea…what was supposed to be simple (decouple something) was spinning its own Gordian knot. A short search later, and we entered the world of message queueing. Open up your queue… Cutting to the chase, over the last 4 years there have been no shortage of open-source message queueing servers written. Apache ActiveMQ gets the most press, but it appears to have some issues not losing messages. ZeroMQ and RabbitMQ both support an open messaging protocol called AMQP. That leaves us with the carrot muncher… Also, RabbitMQ supports persistence. That’s it.

Introducing Cascalog: a Clojure-based query language for Hadoop I'm very excited to be releasing Cascalog as open-source today. Cascalog is a Clojure-based query language for Hadoop inspired by Datalog. Highlights Simple - Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.Expressive - Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.Interactive - Run queries from the Clojure REPL.Scalable - Cascalog queries run as a series of MapReduce jobs.Query anything - Query HDFS data, database data, and/or local data by making use of Cascading's "Tap" abstractionCareful handling of null values - Null values can make life difficult. OK, let's jump into Cascalog and see what it's all about! Basic queries First, let's start the REPL and load the playground: lein repluser=> (use 'cascalog.playground) (bootstrap) This will import everything we need to run the examples. user=> (? This query can be read as "Find all ? OK, let's try something more involved. user=> (? (age ?

Introducing Cloudera Desktop » Cloudera Hadoop & Big Data Bl Today at Hadoop World NYC, we’re announcing the availability of Cloudera Desktop, a unified and extensible graphical user interface for Hadoop. The product is free to download and can be used with either internal clusters or clusters running on public clouds. At Cloudera, we’re focused on making Hadoop easy to install, configure, manage, and use for all organizations. While there exist many utilities for developers who work with Hadoop, Cloudera Desktop is targeting beginning developers and non-developers in an organization who’d like to get value from the data stored in their Hadoop cluster. By working within a web browser, users avoid the tedious client installation and upgrade cycle, and system administrators avoid custom firewall configurations. We’ve worked closely with the MooTools community to create a desktop environment inside of a web browser that should be familiar to navigate for most users. Initial applications for Cloudera Desktop include:

nathanmarz/storm mixi Engineers’ Blog ミクシィの七尾です。すでに1週間ほど経ってしまいましたが、去る2/22-2/23に米国のAviary("エイヴィアリー"と読みます)と共同でPhoto Hack Day Japanというハッカソンを行いました。改めて参加者のみなさまと以下のスポンサー様に感謝させて頂きます。当日は全部で23組の作品発表があり、審査では総合1位から3位までと特別賞が2組、各スポンサー様からAPI賞が選出されました。ちゃっかり審査員としても参加させて頂きましたので、早速受賞作品の紹介をしたいと思います。 1位 Back to the Future (賞金30万円) メンバー： Theeraphol Wattanavekin, Rapee Suveeranont, Yoonjo Shin, Thiti Luang 利用API: Amazon / gettyimages / Leap Motion URL: 説明： Back to the Futureは、時間を超えて旅することのできるcoolなwebアプリです。感想： 4分間という限られた時間でしたが、デモも素晴らしく、独自のコンセプトが光っていたいと思います。 2位 Before The Filter(賞金20万円) メンバー：Benjamin Watanabe, Antony Tran 利用API: Aviary URL: 説明：沢山の素晴らしい画像編集ツールを利用することができますが、多くのユーザーが優れた写真に何が必要なのかをわかっていません。感想：写真に関するハックと言えば、素敵なフィルターとか、顔検出とか、画像合成のようなテクニカルなものだろうなと勝手に思い込んでいましたが、このチームは写真の撮影テクニックに焦点を当てていた点がユニークでした。 3位 VOCA Getty (賞金10万円) メンバー: Atsushi Onoda, Hiroshi Kanamura, Shinichi Segawa, Yasushi Takemoto 利用API: gettyimages / imagga 説明： VOCA gettyは写真をベースにした単語帳アプリです。操作の流れ特別賞 Na･Gu･Ri･A･I 概要

MapReduce Patterns, Algorithms, and Use Cases « Highly Scalable Blog In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below. MapReduce Framework Counting and Summing Problem Statement: There is a number of documents where each document is a set of terms. Solution: Let start with something really simple. The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners: Applications: Log Analysis, Data Querying Collating Problem Statement: There is a set of items and some function of one item. The solution is straightforward.

HBase Leads Discuss Hadoop, BigTable and Distributed Databases Google's recent introduction of their Google Application Engine and its inclusion of access to BigTable has created renewed interest in alternative database technologies. A few weeks back InfoQ interviewed Doug Judd a founder of the Hypertable project which is inspired by Google's BigTable database. This week InfoQ has the pleasure of presenting an interview with HBase leads im Kellerman, Michael Stack, and Bryan Duxbury. HBase is is an open-source, distributed, column-oriented store also modeled after BigTable. 1. HBase is an open-source, distributed, column-oriented store modeled after the Google paper, "Bigtable: A Distributed Storage System for Structured Data" by Chang et al. The HBase project is for those whose yearly Oracle license fees approach the GNP of a small country or whose MySQL install is starting to buckle because tables have a few BLOB columns and the row count is heading north of a couple of million rows. 2. 3. 4. 5. 6. 7. We only see upside (Smile). 8. 9.

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js Series Introduction Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce. But Pig is more than that. Working code for this post as well as setup instructions for the tools we use are available at and you can download the Enron emails we use in the example in Avro format at Part two of this series on HBase and JRuby is available here: Introduction In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. Pig and Avro Pig’s Avro support is solid in Pig 0.10.0. MongoDB’s Java Driver To connect to MongoDB, we’ll need the MongoDB Java Driver. Mongo-Hadoop The mongo-hadoop project provides integration between MongoDB and Hadoop.

Open Source - LightCloud - Distributed and persistent key value Distributed and persistent key-value database Features Built on Tokyo Tyrant. One of the fastest key-value databases [benchmark]. But that's not all, we also support Redis (as an alternative to Tokyo Tyrant)! Check benchmarks and more details about Redis in LightCloud adds support for Redis. Stability It's production ready and Plurk.com is using it to store millions of keys on only two servers that run 3 lookup nodes and 6 storage nodes (these servers also run MySQL). How LightCloud differs from memcached and MySQL? memcached is used for caching, meaning that after some time items saved to memcached are deleted. MySQL and other relational databases are not efficient for storing key-value pairs, a key-value database like LightCloud is. The bottom line is that LightCloud is not a replacement for memcached or MySQL - it's a complement that can be used in situations where your data does not fit that well into the relational model. How LightCloud differs from redis and memcachedb? Benchmark program

Using Gearman For Distributed Alerts - BackType Technology At BackType we manage over 30 virtual machines (EC2). We've leveraged the latest technology in cloud computing, storage, and data processing to index over one billion online reactions (comment-like data) and organize those conversations to help users find the latest news and opinions. When you run dozens of machines, you're inevitably going to want some kind of monitoring in place. There are plenty of existing tools available such as monit , god , daemontools , etc for lower level systems management. However, as we rapidly deploy new technology and features, we've required more customizable monitoring. Gearman Gearman is a system to farm out work to other machines, dispatching function calls to machines that are better suited to do work, to do work in parallel, to load balance lots of function calls, or to call functions between languages. We use Gearman to farm out millions of jobs across multiple machines every single day. This is how we did it: import sys, cjson # Replace server IP(s) pass

Developing High Performance Asynchronous IO Applications Published on ONLamp.com ( See this if you're having trouble printing code examples by Stas Bekman 10/12/2006 Creating Financial Friction for Spammers Why do spammers send billions of email messages advertising ridiculous products that most of us would never in our lives consider buying? What makes spamming profitable is huge volume. Ken Simpson and Will Whittaker, formerly developers at ActiveState, founded MailChannels to solve the spam problem. By observing spammer behavior, the MailChannels team realized that spammers are impatient. Nowadays, the majority of spam is sent from botnets--vast, distributed networks of compromised Windows PCs. While botnets are vast in size and availability, the number of machines and the sending capacity of any particular botnet is limited. By slowing down email from suspicious sources (often botnets), the MailChannels team figured they could probably make the spammers give up and move on. The First Generation Figure 1. Figure 2.

Superorganism - Wikipedia, the free encycl A termite mound made by the cathedral termite A coral colony A superorganism is an organism consisting of many organisms. The term was originally coined James Hutton (1726-1797), the "Father of Geology" in 1789. The term is now usually meant to be a social unit of eusocial animals, where division of labour is highly specialised and where individuals are not able to survive by themselves for extended periods of time. The Gaia hypothesis of James Lovelock[2] and the work of James Hutton, Vladimir Vernadsky and Guy Murchie, have suggested that the biosphere can be considered a superorganism. Superorganisms are important in cybernetics, particularly biocybernetics. Superorganic in social theory[edit] Similarly, economist Carl Menger expanded upon the evolutionary nature of much social growth, but without ever abandoning methodological individualism. The term superorganic was adopted by anthropologist Alfred L. Problems and criticisms[edit] Literature[edit] Jürgen Tautz, Helga R. See also[edit]