background preloader

Data Governance

Facebook Twitter

The 2 Types of Data Strategies Every Company Needs. More than ever, the ability to manage torrents of data is critical to a company’s success.

The 2 Types of Data Strategies Every Company Needs

But even with the emergence of data-management functions and chief data officers (CDOs), most companies remain badly behind the curve. Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all. More than 70% of employees have access to data they should not, and 80% of analysts’ time is spent simply discovering and preparing data.

Data breaches are common, rogue data sets propagate in silos, and companies’ data technology often isn’t up to the demands put on it. Having a CDO and a data-management function is a start, but neither can be fully effective in the absence of a coherent strategy for organizing, governing, analyzing, and deploying an organization’s information assets. Defense Versus Offense Single Source, Multiple Versions OK. Now That You’ve Gone Swimming in the Data Lake – Then What? Image source: EMC Data Lakes are a new paradigm in data storage and retrieval and are clearly here to stay.

Now That You’ve Gone Swimming in the Data Lake – Then What?

As a concept they are an inexpensive way to rapidly store and retrieve very large quantities of data that we think we want to save but aren’t yet sure what we want to do with. DataLake. Database · big data tags: Data Lake is a term that's appeared in this decade to describe an important component of the data analytics pipeline in the world of Big Data.

DataLake

The idea is to have a single store for all of the raw data that anyone in an organization might need to analyze. Commonly people use Hadoop to work on the data in the lake, but the concept is broader than just Hadoop. Data: consistency vs. availability - Data Management & Decision Support. There is a fundamental choice to be made when data is to be 'processed': a choice between consistency vs. availability ora choice between work upstream vs. work downstream ora choice between a sustainable (long term) view vs. an opportunistic (short term) view on data Cryptic, I know.

Data: consistency vs. availability - Data Management & Decision Support

Let me explain myself a bit. Let me take you on a short journey. Consistency vs. availability, I choose a for a position skewed towards consistency. If I choose consistency, I need to validate the data before it is processed, right? "But, but, but...we need the data What are the options? "But, but, but...we can't ask them that, they never change it, we gotta deal with the data as we receive it". Ah, so you want to process the data, despite the fact that it does not adhere to the logical model? You want to slide to the right on the spectrum of consistency vs. availability? 4 Quadrant Model for Data Deployment - Blog: Ronald Damhof.

I have written numerous times about efficient and effective ways of deploying data in organizations.

4 Quadrant Model for Data Deployment - Blog: Ronald Damhof

How to architect, design, execute, govern and manage data in an organization is hard and requires discipline, stamina and courage by all those involved. The challenge increases exponentially when scale and/or complexity increases. On average a lot of people are involved and something of a common understanding is vital and unfortunately often lacking. As a consultant it is a huge challenge getting management and execution both on the same page, respecting the differences in responsibilities and the level of expertise regarding the field we are operating in. An abstraction functions as a means of communication across the organization. For data deployment I came up with the so-called '4 Quadrant model' (4QM). It starts with a basic assumption that data deployment starts with raw materials and ends up in some sort of product.

There is no single version of truth There are many truths out there. Data Wrangling, Information Juggling and Contextual Meaning, Part 2. The pragmatic definitions presented in part 1 of information as the subjectively interpreted record of personal experiences physically stored on (mostly) digital media, and data/metadata as information that has been modeled and “deconstructed” for ease of computer processing, offer an explanation as to why data wrangling or preparation can be so time-consuming for data scientists.

Data Wrangling, Information Juggling and Contextual Meaning, Part 2

If the external material is in the form of data (for example, from devices on the Internet of Things), the metadata may be minimal or non-existent; the data scientist must then complete or deduce the context from any available metadata or the data values themselves. In the case of external, loosely structured information such as text or images, the data scientist must interpret the context from within the content itself and prior experience.

The importance of context for data wrangling (and, indeed, analysis) cannot be over-estimated. As seen above, context may exist both within formal metadata and elsewhere. Data Wrangling, Information Juggling and Contextual Meaning, Part 1. “Data wrangling is a huge—and surprisingly so—part of the job,” said Monica Rogati, vice president for data science at Jawbone, in a mid-2014 New York Times article by @SteveLohr that I came across recently.

Data Wrangling, Information Juggling and Contextual Meaning, Part 1

“At times, it feels like everything we do.” With all due respects to Ms. Rogati, the only surprising thing about this is her surprise. Context becomes key - Now...Business unIntelligence. Wikipedia: Definition Data Governance. MIKE2.0. EIM Institute. Data Governance Institute (US)