Hadoop

TwitterFacebook
Get flash to fully experience Pearltrees
Text Analytics and Natural language

FileMap: File-Based Map-Reduce

FileMap is a lightweight system for applying Unix-style file processing tools to large amounts of data stored in files. It provides full map-reduce functionality without requiring that you switch your processing to any particular language or runtime environment, install any special software, or have root on your storage and processing nodes. Features http://mfisk.github.com/filemap/
http://code.google.com/p/octopy/ Inspired by Google's MapReduce and Starfish for Ruby, octo.py is a fast-n-easy MapReduce implementation for Python. Octo.py doesn't aim to meet all your distributed computing needs, but its simple approach is amendable to a large proportion of parallelizable tasks. If your code has a for-loop, there's a good chance that you can make it distributed with just a few small changes.

octopy - Project Hosting on Google Code

http://www.galagosearch.org/guide.html Warning Some of this text is out of date and refers to an older version of Galago.

Galago Guidebook

http://code.google.com/p/qizmt/ MySpace Qizmt is a mapreduce framework for executing and developing distributed computation applications on large clusters of Windows servers. The MySpace Qizmt project develops open-source software for reliable, scalable, super-easy, distributed computation software. MySpace Qizmt core features include:

qizmt - Project Hosting on Google Code

Mapreduce Bash Script

One night at the pub we discussed whether one could replace Hadoop (a massive and comprehensive implementation of Mapreduce) with a single bash script, an awk command, sort, and a sprinkling of netcat. http://blog.last.fm/2009/04/06/mapreduce-bash-script

cloudmapreduce - Project Hosting on Google Code

Cloud MapReduce was initially developed at Accenture Technology Labs. http://code.google.com/p/cloudmapreduce/
https://issues.apache.org/jira/browse/MAPREDUCE-64

MAPREDUCE-64] Map-side sort is hampered by io.sort.record.percent

one simple way might be to simply add TRACE level log messages at every collect() call with the current values of every index plus the spill number [...] That could be an interesting visualization.