background preloader


Facebook Twitter

Insight Data Engineering Ecosystem: An Interactive Map. Enterprise service bus. All customer services communicate in the same way with the ESB: the ESB translates a message to the correct message type and sends the message to the correct producer service.

Enterprise service bus

An enterprise service bus (ESB) is a "software architecture" model used for designing and implementing communication between mutually interacting software applications in a service-oriented architecture (SOA). As a software architectural model for distributed computing, it is a specialty variant of the more general client server model and promotes agility and flexibility with regard to communication between applications.

Its primary use is in enterprise application integration (EAI) of heterogeneous and complex landscapes. Overview[edit] Duties[edit] An ESB transports the design concept of modern operating systems to networks of disparate and independent computers. The prime duties of an ESB are: Extract, transform, load. In computing, Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that: Extracts data from homogeneous or heterogeneous data sourcesTransforms the data for storing it in proper format or structure for querying and analysis purposeLoads it into the final target (database, more specifically, operational data store, data mart, or data warehouse) Usually all the three phases execute in parallel since the data extraction takes time, so while the data is being pulled another transformation process executes, processing the already received data and prepares the data for loading and as soon as there is some data ready to be loaded into the target, the data loading kicks off without waiting for the completion of the previous phases.

Extract, transform, load

ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate computer hardware. Extract[edit] Transform[edit] Data lake. A data lake is a large storage repository and processing engine, they provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs".[1] The term was coined by James Dixon, Pentaho chief technology officer.[2] Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data.

Data lake

He wrote: "If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. " [3] Dixon argued that data marts have several inherent problems, and that data lakes are the optimal solution.

Operational Intelligence, Log Management, Application Management, Enterprise Security and Compliance. Data warehouse. Data mart. What is a Data Management Platform, or DMP? This is the latest in a series of articles that explains, in plain English, new technology tools and platforms that are changing the face of digital media.

What is a Data Management Platform, or DMP?

Our first entry covered DSPs. To suggest new entries, please email me at the address below. Spark for Data Padawans Episode 1: a look at distributed data storage. If you've been anywhere near data in the past year or so you must have heard about the war going on between Spark and Hadoop for total control over the management of large amounts of data.

Spark for Data Padawans Episode 1: a look at distributed data storage

We have a big announcement coming at Dataiku about Spark, so ever since I started working that word has been popping up every day and I kept wondering what it could mean. I’ve already written about my limited technical background before arriving at Dataiku. Luckily, in the past month, I’ve had the opportunity to speak to all of our brilliant data scientists and developers, as well as a couple of data experts. Spark for Data Padawans Episode 2: Spark vs Hadoop? The cat is out of the bag, Data Science Studio now integrates with Spark!

Spark for Data Padawans Episode 2: Spark vs Hadoop?

It's the perfect moment (I know, crazy good timing right!) For me to continue my presentation of Spark for super beginners with episode 2: the birth of Spark and how it compares to Hadoop. As a reminder, this is episode 2 of my investigation into what the heck Spark is. These are the other episodes, including the upcoming episodes 2 and 3: Spark for Data Padawans Episode 3: Spark vs MapReduce. After learning about Hadoop and distributed data storage, and what exactly Spark is in the previous episodes, it's time to dig a little deaper to understand why even if Spark is great, it isn't necessarily a miracle solution to all your data processing issues.

Spark for Data Padawans Episode 3: Spark vs MapReduce

It's time for Spark for super beginners episode 3! As always, I try to keep these articles as easy to understand as possible, but if you really are a super data padawan you probably need to have a quick look at episode 1 and episode 2 to understand what I'm talking about. You can always go back to a previous episode later: After reading episode 1 and episode 2, Spark seems pretty great.

You’re probably thinking that it can only replace MapReduce and any other system out there since it can: process large volumes of data super fast, in a resilient manner and it is particularly practical for Machine Learning on very large datasets. So why wouldn’t everyone set up Spark?! Spark is fast. Do you need stream processing? Open sourced.