Database

> >

Business intelligence industry trends. February 21, 2012 This is one of a series of posts on business intelligence and related analytic technology subjects, keying off the 2011/2012 version of the Gartner Magic Quadrant for Business Intelligence Platforms. The four posts in the series cover: Besides company-specific comments, the 2011/2012 Gartner Magic Quadrant for Business Intelligence (BI) Platforms offered observations on overall BI trends in a “Market Overview” section.

I have mixed feelings about Gartner’s list. In particular: Not inconsistently with my comments on departmental analytics, Gartner sees actual BI business users as favoring ease of getting the job done, while IT departments are more concerned about full feature sets, integration, corporate standards, and license costs.However, Gartner says as a separate point that all kinds of users want to relieve some of the complexity of BI, and really of analytics in general. Here’s the forest that I suspect Gartner is missing for the trees: Let me be even more specific. Sumo Logic and UIs for text-oriented data. February 6, 2012 I talked with the Sumo Logic folks for an hour Thursday. Highlights included: Sumo Logic does SaaS (Software as a Service) log management.Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.Sumo Logic’s main differentiation is automated classification of events.

There’s some kind of streaming engine in the mix, to update counters and drive alerts.Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. What interests me about Sumo Logic is that automated classification story. The payoff is that machine learning directly informs the Sumo Logic user interface. Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions. MarkLogic’s Hadoop connector. November 3, 2011 It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector. Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you: Hadoop can talk XQuery to MarkLogic.

But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. Otherwise, the whole thing is just what you would think: Hadoop can read from and write to MarkLogic, in parallel at both ends.If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.” Teradata Unity and the idea of active-active data warehouse replication. October 3, 2011 Teradata is having its annual conference, Teradata Partners, at the same time as Oracle OpenWorld this week. That made it an easy decision for Teradata to preannounce its big news, Teradata Columnar and the rest of Teradata 14.

But of course it held some stuff back, notably Teradata Unity, which is the name chosen for replication technology based on Teradata’s Xkoto acquisition. The core mission of Teradata Unity is asynchronous, near-real-time replication across Teradata systems. The point of “asynchronous” is performance. The point of “near-real-time” is that it Teradata Unity can be used for high availability and disaster recovery, and further can be used to allow real work on HA and DR database copies.

Teradata Unity works request-at-a-time, which limits performance somewhat;* Unity has a lock manager that makes sure updates are applied in the same order on all copies, in cases where locks are needed at all. As Teradata tells it, Teradata Unity has two key aspects: Investigative analytics and derived data: Enzee Universe 2011 talk. The Vertica story (with soundbites!)

June 20, 2011 I’ve blogged separately that: And of course you know: Vertica (the product) is columnar, MPP, and fast. *Vertica (the company) was recently acquired by HP.** *Similar things seem true of ParAccel, but most of the other serious columnar analytic DBMS aren’t actually MPP (Massively Parallel Processing) yet. ** Vertica says it has a “staggering” pipeline now that it’s been with HP for a few months. As for product maturity: Vertica 4.0 cleaned up a lot of stuff.Vertica 5.0 goes further in a variety of areas, notably clustering administration and database tuning/design. But here’s something I hadn’t fully realized — Vertica claims concurrent usage as a competitive strength. Vertica says that it has some customers with 1000s of users, in BI/dashboarding kinds of applications.Vertica asserts it can support 1000 users on a single appliance rack.Vertica tries to drive POCs (Proofs Of Concept) towards testing concurrency.

Comments. Temporal data, time series, and imprecise predicates. Dirty data, stored dirt cheap. June 4, 2011 A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it. Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets.

Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code. So the question arises — why would you want to spend serious money to look after your low-value data?

Comments. Hardware for Hadoop. June 4, 2011 After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell: Hadoop nodes today tend to run on fairly standard boxes.Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.The number of spindles per core on Hadoop node boxes is going up even as disks get bigger. A key input comes from Cloudera, who to my joy delegated the questions to Omer Trajman, who wrote: Most Hadoop deployments today use systems with dual socket and quad or hex cores (8 or 12 cores total, 16 or 24 hyper-threaded).

Storage has increased as well with 6-8 spindles being common and some deployments going to 12 spindles. These are SATA disks with between 1TB and 2TB capacity. Bullet points from that year-ago link include: So basically we’re talking in the range of 2-3 GB of RAM per core — and 1 spindle per core, up from perhaps half a spindle per core a year ago. Why you would want an appliance — and when you wouldn’t. June 2, 2011 Data warehouse appliances are booming. But Hadoop appliances are a non-starter. Data warehouse and other data management appliances are on the upswing. Oracle is pushing Exadata.

Teradata* is going strong, and also recently bought Aster Data. IBM bought Netezza. Greenplum and Vertica were bought by EMC and HP respectively. *As far as I’m concerned, all Teradata hardware-included systems are appliances. In essence, there are two kinds of reasons to prefer appliances over software-only offerings: Technology.Internal politics. “Technology” can include performance, price/performance, ease of installation, ease of administration, and so on. But it turns out technology isn’t the only reason to like appliances. Of course, similar considerations arise with SaaS (Software as a Service). But I’ll tell you one area where appliances seem to make little sense — running web-oriented, often scale-out, often open-source data management software.

Comments. Object-oriented database management systems (OODBMS) May 21, 2011 There seems to be a fair amount of confusion about object-oriented database management systems (OODBMS). Let’s start with a working definition: An object-oriented database management system (OODBMS, but sometimes just called “object database”) is a DBMS that stores data in a logical model that is closely aligned with an application program’s object model. Of course, an OODBMS will have a physical data model optimized for the kinds of logical data model it expects.

If you’re guessing from that definition that there can be difficulties drawing boundaries between the application, the application programming language, the data manipulation language, and/or the DBMS — you’re right. Examples of what I would call OODBMS include: Intersystems Cache’, the most successful OODBMS product by far, with good OLTP (OnLine Transaction Processing) capabilities and a strong presence in the health care market. Comments. Starcounter high-speed memory-centric object-oriented DBMS, coming soon. Transparent sharding. February 24, 2011 When databases are too big to manage via a single server, responsibility for them is spread among multiple servers. There are numerous names for this strategy, or versions of it — all of them at least somewhat problematic. The most common terms include: (Shared-nothing) MPP (Massively Parallel Processing), often used to describe analytic DBMS.

I plan to start using the term transparent sharding to denote a data management strategy in which data is assigned to multiple servers (or CPUs, cores, etc.), yet looks to programmers and applications as if it were managed by just one. DbShards and ScaleBase feature transparent sharding (this is the case which inspired me to introduce the term).Anything which has ever reasonably been called a “shared-nothing” MPP DBMS features transparent sharding.Memcached features transparent sharding. *One reason not to switch terms: “MPP” is marvelously concise. What do you think of this terminology? Revolution Analytics update. April 8, 2011 I wasn’t too impressed when I spoke with Revolution Analytics at the time of its relaunch last year. But a conversation Thursday evening was much clearer. And I even learned some cool stuff about general predictive modeling trends (see the bottom of this post). Revolution Analytics business and business model highlights include: Revolution Analytics is an open-core vendor built around the R language.

That is, Revolution Analytics offers proprietary code and support, with subscription pricing, that help in the use of open source software.Unlike most open-core vendors I can think of, Revolution Analytics takes little responsibility for the actual open source part. Revolution Analytics’ top market sector by far appears to be financial services, both in trading/investment banks/hedge funds and in credit cards/risk analysis. When I asked Revolution Analytics why one would use R rather than, say, SAS, Revolution cited three reasons that seemed to be driving customer interest: PostgreSQL 8.4: Creating a Database. The first test to see whether you can access the database server is to try to create a database.

A running PostgreSQL server can manage many databases. Typically, a separate database is used for each project or for each user. Possibly, your site administrator has already created a database for your use. He should have told you what the name of your database is. To create a new database, in this example named mydb, you use the following command: $ createdb mydb If this produces no response then this step was successful and you can skip over the remainder of this section. If you see a message similar to: createdb: command not found then PostgreSQL was not installed properly. . $ /usr/local/pgsql/bin/createdb mydb The path at your site might be different. Another response could be this: createdb: could not connect to database postgres: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

$ createdb. VoltDB. eXtremeDB. History[edit] Later editions targeted the high performance non-embedded software market, including capital markets applications (algorithmic trading, order matching engines) and real-time caching for Web-based applications, including social networks and e-commerce. Features added to support this focus include a SQL ODBC and JDBC interfaces, 64-bit support, and multiversion concurrency control (MVCC) transaction management.[4] Product features[edit] Core eXtremeDB engine[edit] eXtremeDB supports the following features across its product family.[5] In-process architecture[edit] eXtremeDB runs in-process with an application, rather than as a database server that is separate from client processes. Application programming interfaces[edit] Database indexes[edit] Concurrency mechanisms[edit] Supported data types[edit] Security[edit] Optional features[edit] Distributed database management abilities[edit] Hybrid storage[edit] Transaction logging[edit] SQL ODBC/JDBC[edit] Kernel mode deployment[edit]

PostgreSQL: The world's most advanced open source database. NOSQL Databases. Architectural options for analytic database management systems. January 18, 2011 Mike Stonebraker recently kicked off some discussion about desirable architectural features of a columnar analytic DBMS. Let’s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general. But first, a few housekeeping notes: This is a very long post.Even so, to keep it somewhat manageable, I’ve cut corners on completeness.

Most notably, two important areas are entirely deferred to future posts — advanced-analytics-specific architecture, and in-memory processing (including CEP).The subjects here are not strictly parallel. OK. Relational/SQL support. Depending on your use case, you might have additional make-or-break requirements. Additional query functionality, of course with good performance. Other possibly important features — but ones that would usually go on “nice to have” rather than “must have” lists — include: So what kinds of architectural choices (or major features) should one look to to support such features? Analytic platforms defined. February 24, 2011 A few weeks ago, I described the elements of an “analytic computing system” or “analytic platform,” while reserving judgment as to which of the two terms would or should win out. I am now capitulating to the term analytic platform, under the influence of, among others, Sharmila Mulligan (and Aster Data in general), Vertica and a variety of fellow analysts (Merv Adrian, Neil Raden, Seth Grimes, Jim Kobielus, and Colin White).

While Google evidence would suggest it’s way too early to make this call, I think it’s time to say “analytic platform” will win. What’s more, I now think the phrase “analytic platform” should win. While I think the term “platform” is overused to the point of silliness, at least the phrase “analytic platform” is short. To take this in the direction of an actual definition, I’ll say that the three essential elements of an analytic platform are: Strong support for analytic database query. So what do you think? DataStax introduces a Cassandra-based Hadoop distribution called Brisk. Hadapt is launching.