I first heard the term five years ago, while Vertica was a tiny startup based in Boston. It seemed that a few risk-tolerant businesses in California were trying out this thing called Hadoop as a place to park data that they’d previously been throwing away. Many businesses have been throwing away all but a tiny portion of their data simply because they can’t find a cost effective place to store it. To these companies, Hadoop was a godsend.
And yet in some key ways, Hadoop is also extremely limited. Technology teams continue to wrestle with extracting value from a Hadoop investment. Their primary complaint? That there is no easy way to explore and ask questions of data stored in Hadoop. Technology teams understand SQL, but Hadoop provides only the most basic SQL support. I’ve even heard stories of entire teams resigning en masse, frustrated that their company has put them in a no-win situation – data everywhere and not a drop to drink.
Variations on the above story have undoubtedly played out at many companies across the globe. The common theme is that, love it or hate it, SQL is one of the core languages for exploration and inquiry of semi-structured and structured data. And most SQL on Hadoop offerings are simply not up to the task. As a result, we now have a gold rush of sorts, with multiple vendors rushing to build SQL on Hadoop solutions. To date, there are at least seven different commercial SQL for Hadoop offerings, and many organizations are learning about the very big differences between these offerings!
SQL is one of the core languages for exploration and inquiry of semi-structured and structured data.
In our many conversations with C-level technology executives, we’ve heard a common set of concerns about most SQL on Hadoop options. Some are significant. So, without further ado, here are the top five reasons SQL on Hadoop keeps CIO’s awake at night:
5. Is it secure? Really?
The initial appeal of the data lake is that it can be a consolidated store – businesses can place all their data in one place. But that creates huge risk because now…all the data is in one place. Therefore, our team has been working diligently a SQL on Hadoop offering that not only consists of core enterprise security features, but it also requires the ability to secure data in flight with such things as SSL encryption, integration with enterprise security systems such as Kerberos, and a column-level access model. If your SQL on Hadoop solution doesn’t offer these features, your data is at risk.
4. Does it support all the SQL you need?
Technically, SQL on Hadoop has been around for years now in the form of an open source project called Hive. Hive has its own version of SQL called HQL. Hive users frequently complain that HQL only supports a subset of SQL. There are many things you just can’t do. This requires all manner of data flow contortions as analysts must continually resort to other tools or languages for things that are very expressible in SQL…if only the Hadoop environment supported it.
This problem remains today, as many of the SQL on Hadoop variants do not support the full range of ANSI SQL. For example, our benchmark team regularly performs tests with the Vertica SQL on Hadoop product to ensure that it meets our standards for quality, stability and performance. One of the test suites we use is the TPC-H benchmark. For those not in the know, TPC-H is an industry standard benchmark with pre-defined SQL, schemas, and data. While our engine runs the full suite of tests, other SQL on Hadoop flavors that we’ve tested are not capable of running the entire workload. In fact, some of them only run 60% of the queries!
3. …And if it runs the SQL, does it run well?
It’s one thing to implement a SQL engine that can parse a bit of SQL and create an execution plan to go and get the data. It’s a very different thing to optimize the engine such that it does these things quickly and efficiently. I’ve been working with database products for almost thirty years now, and have seen over and over that the biggest challenge faced by any SQL engine is not creating the engine, but in dealing with the tens of thousands of edge cases that will arise in the real world.
For example, being aware of sort order in stored data on disk can dramatically improve query performance. Moreover, optimizing the storage of the data to leverage the sort sequence with something like run-length encoding can further improve performance. But not if the SQL engine doesn’t know how to deal with this. One example of an immature implementation is an engine that cannot use just-in-time decompression of highly compressed data. If the system has to pay the CPU penalty of decompressing highly compressed data every time it is queried, why bother compressing it in the first place, except maybe to save disk space? Also, if a user needs to keep extremely high-performance aggregations in sync with the transaction data, unless the engine has been written to manage the data this way, and be aware of the data characteristics at run-time, this simply won’t be possible.
These are just two examples. But it can make the difference between a query taking one second, or two days. Or worse, crashing when you try to run it because uncompressed data overflows the memory and crashes the database.
2. Does it just dump files to a file-system, or actively manage and optimize storage?
Projects built for Hadoop almost invariably pick up some of the “baggage” of using the core Hadoop functionality. For example, some of the SQL on Hadoop offerings just dump individual files into the filesystem as data is ingested. After loading a year of data, you’re likely to find yourself with hundreds of thousands of individual files. This is a performance catastrophe. Moreover, to optimize these files a person has to manually do something –write a script, run a process, call an executable, etc. This just adds to the real cost of the solution in terms of administrative complexity and design complexity to work around performance issues. What a business needs is a system which simplifies this by managing and optimizing files automatically.
1. Does it just dump files to a file-system, or actively manage and optimize storage?
There are certain fundamentals about databases that have made them so common for tracking key business data today. One of these things is called ACID compliance. It’s an acronym that doesn’t bear explaining here, so suffice it to say that one of the things an ACID-compliant database guarantees is that if two people ask the exact same question of the exact same data at the exact same time, they will get the same answer.
Seems kind of obvious, doesn’t it? And a common issue with SQL on Hadoop distributions is that they may lack ACID compliance. This isn’t so good for data science to create predictive models for growing the business, and certainly not suitable for producing financials! Caveat Emptor.
Many of our customers consider these five areas to be a benchmark for measuring SQL on Hadoop maturity. SQL on Hadoop offerings that fail to deliver these things will drive up the cost and time it takes to solve problems as analysts must use a mix of tools, work around performance and stability limitations, etc. And in the context of massive data thefts taking place today, how many CIOs feel comfortable with three petabytes of unsecured data pertaining to every single aspect of their business being accessible to anyone with a text editor and a bit of Java programming know-how?
The good news is that we at HP have been thinking of these concerns for years now. And working on solving them. Vertica SQL on Hadoop addresses each of these concerns in a comprehensive way, so organizations can finally unlock the full value of their data lake. We’re happy to tell you more about this, and we’d love for you to try it out!