Using the Cassandra Bulk Loader. Transparent data encryption. Transparent data encryption (TDE) protects at rest data.
TDE requires a secure local file system to be effective. Apache cassandra & apache spark for time series data. Lightning fast analytics with Spark and Cassandra. Multi data center Apache Spark and Apache Cassandra benchmark. Data processing platforms architectures with SMACK: Spark, Mesos, Akka, Cassandra and Kafka. This post is a follow-up of the talk given at Big Data AW meetup in Stockholm and focused on different use cases and design approaches for building scalable data processing platforms with SMACK(Spark, Mesos, Akka, Cassandra, Kafka) stack.
While stack is really concise and consists of only several components it is possible to implement different system designs which list not only purely batch or stream processing, but more complex Lambda and Kappa architectures as well. So let's start with a really short overview to be on the same page and continue with designs and examples coming from production projects experience. Apache cassandra & apache spark for time series data. Spark-cassandra-connector/14_data_frames.md at master · datastax/spark-cassandra-connector. Installing the Cassandra / Spark OSS Stack. Init As mentioned in my portacluster system imaging post, I am performing this install on 1 admin node (node0) and 6 worker nodes (node[1-6]) running 64-bit Arch Linux.
Most of what I describe in this post should work on other Linux variants with minor adjustments. Overview When assembling an analytics stack, there are usually myriad choices to make. Installing the DataStax Distribution of Apache Cassandra 3.x on RHEL-based systems. Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack. When we first made MarkedUp Analytics available on an invite-only basis to back in September we had no idea how quickly the service would be adopted.
By the time we completely opened MarkedUp to the public in December, our business was going gangbusters. But we ran into a massive problem by the end of November: it was clear that RavenDB, our chosen database while we were prototyping our service, wasn’t going to be able to keep growing with us. So we had to find an alternative database and data analysis system, quickly! The Nature of Analytic Data The first place we started was by thinking about our data, now that we were moving out of the “validation” and into the “scaling” phase of our business.
Crash - Cassandra Commit and Recovery on a Single Node. Cassandra Hits One Million Writes Per Second on Google Compute Engine. Google is known for creating scalable high performance systems.
In a recent blog post, we demonstrated how Google Cloud Platform can rapidly provision and scale networking load to handle one million requests per second. A fast front end without a fast backend has limited use, so we decided to demonstrate a backend serving infrastructure that could handle the same load. We looked at popular open source building blocks for cloud applications and choose Cassandra, a NoSQL database designed for scale and simplicity. Using 330 Google Compute Engine virtual machines, 300 1TB Persistent Disk volumes, Debian Linux, and Datastax Cassandra 2.2, we were able to construct a setup that can: Cassandra data modeling - Practical considerations @ netflix.
Does CQL support dynamic columns / wide rows? The transition to CQL has been tough for some people who were used to the existing Thrift-based data model.
A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible. A note on terminology Part of the confusion comes from the Thrift api using the same terms as CQL/SQL to mean different things. To avoid ambiguity, I will only use these terms in the CQL sense here: So when someone asks, does “does CQL support dynamic columns?” Download NoSQL Apache Cassandra. What is DataStax Community Edition Apache Cassandra?
DataStax Community Edition is a free packaged distribution of Apache Cassandra made available by DataStax. There’s no faster, easier way to get started with Apache Cassandra than to download, install, and use DataStax Community Edition. Brand new to Apache Cassandra and need a tutorial? Get started with the 10 minute walkthrough for developers and administrators. DataStax Community Edition Consists of Several Components: • The “Most Stable and Recommended Release”, “Latest Development Release” or “Archive Release” of Apache Cassandra • DataStax OpsCenter Monitoring Tool (Included in the Windows .MSI installer packages; click here to download & setup OpsCenter for other operating systems) • Sample application and demo database • Smart installers for Linux, Windows, and Macintosh • Easy uninstall of DataStax Community Edition.
Using Cassandra for real-time analytics: part 2. In the part 1 of this series, we talked about the speed-consistency-volume trade-offs that come along with implementation choices you make in data analytics and why Cassandra is a great choice for real-time analytics. In this post, we’re going to dive a little deeper on the basics of the Cassandra data model and illustrate with the help of MarkedUp’s own model, followed by a short discussion about our read and write strategies. Once again, lets start off of our LA Cassandra User group meetup’s presentation deck on slideshare. The Official Rackspace Blog. EDITOR’S NOTE: This article describes an obsolete version of Apache Cassandra.
For tutorials covering modern Cassandra, please visit Cassandra has received a lot of attention of late, and more people are now evaluating it for their organization. As these folks work to get up to speed, the shortcomings in our documentation become all the more apparent. Easily, the worst of these is explaining the data model to those with an existing background in relational databases. The problem is that Cassandra’s data model is different enough from that of a traditional database to readily cause confusion, and just as numerous as the misconceptions are the different ways that well intentioned people use to correct them. Some folks will describe the model as a map of maps, or in the case of super columns, a map of maps of maps. The problem is that it’s difficult to explain something new without using analogies, but confusing when the comparisons don’t hold up. Advanced Time Series with Cassandra. Cassandra is an excellent fit for time series data, and it’s widely used for storing many types of data that follow the time series pattern: performance metrics, fleet tracking, sensor data, logs, financial data (pricing and ratings histories), user activity, and so on.
A great introduction to this topic is Kelley Reynolds’ Basic Time Series with Cassandra. If you haven’t read that yet, I highly recommend starting with it. This post builds on that material, covering a few more details, corner cases, and advanced techniques. Indexes vs Materialized Views When working with time series data, one of two strategies is typically employed: either the column values contain row keys pointing to a separate column family which contains the actual data for events, or the complete set of data for each event is stored in the timeline itself.
The top column family contains only a timeline index; the bottom, the actual data for the events. All event data is serialized as JSON in the column values. Petits Retours Sur Cassandra - Jetoile. Suite à de nombreuses présentations de Cassandra (faites, entre autre, par Michaël Figuière) et à une opportunité de regarder plus précisément ce qui se cachait réellement derrière cette implémentation d’une solution de la famille des produits NoSQL orienté colonnes, je vais, dans cet article, tenter de décrire ce que j’ai aimé et ce que je n’ai pas aimé sur Apache Cassandra.
Je tiens toutefois à préciser que je n’ai aucune expérience réelle sur le produit et que je ne m’appuierai donc que sur sa documentation officielle en version 1.0 qui date du 2 mars 2012. En outre, je ne m’attarderai que sur les points qui m’ont semblé intéressants et marquants. Pour ceux qui sont coutumiés de ce blog, je ne changerai pas mes habitudes et je me contenterai seulement de traduire de manière libre les passages qui m’ont intéressés ;–).
A noter que de nombreux points sont rappelés à différents endroits dans ce document mais c’est également le cas dans la documentation officielle. Cloudsoft/brooklyn-acunu. C* Summit 2013: Real World, Real Time Data Modeling. Real time analytics with Cassandra - Tom Wilkie. Hadoop - Performing bulk load in cassandra with map reduce. BulkLoad To the Cassandra with the Hadoop. To bulkload the data to the Cassandra using Hadoop Cassandra introduces new OutputFormat that is BulkOutputFormat. Cassandra has implemented it in such a way that each map or reduce(depends on implementation) will generate sstables with data provided and then stream them to Cassandra with sstableloader. Don’t worry you need not know about all this implementation details to use BulkOutputFormat, all you need to know is some job configuration and basic thrift call to create columns and mutations.
Initial Setup to write Hadoop job with BulkOutputFormatCassandra Related configurationUsing BulkOutputFormatTo start with the development you need to have all the jars from Cassandra-1.1.x into classpath.All the Hadoop related jars should be in classpath.For execution of this job you should have all the Cassandra jars in classpath of Hadoop. We have to set following properties to Hadoop configuration Object. e.g. Using the Cassandra Bulk Loader. The Apache Cassandra Project. BulkLoad To the Cassandra with the Hadoop.