When Hadoop tools disagree with each other

9 Dec '15

Big Data

Graph of 8-year spike

The above graph from our data warehouse system was recently tweeted out, and got some amused responses about 2007 wanting its data back. But there’s actually a good story here, and a chance to look at Hadoop’s dirty secrets.

Sawtooth graph

Picture if you will a graph showing the time between now and the last record to be inserted into a table (similar to the one above). For a regular batch operation, you’d expect a nice saw tooth graph - slowly ramping up for 20 minutes, followed by a sharp drop to around 5 minutes. The minimum value is the time it takes for the job to run, the height of the graph is how long between runs. The graph above shows different batch operations, with different timings.

Jagged sawtooth graph

But if we zoom in on that flat portion of the first graph, we don’t see a nice clean sawtooth graph, it’s actually rather jagged. To explain that lets look at the process it’s graphing.

Data is sqooped into HDFS from a standard RDBMS; it’s then processed in Hive, and finally we perform an insert overwrite into the production Hive table. But this graph is generated using Impala, the interface that our BI tool uses. In order for Impala to know about changed data in a Hive table, you need to perform a refresh to update the catalog server’s records. Since the Impala refresh command is out of sync with the imports, we get a jagged sawtooth graph.

But a few times a day you get something worse than that jagged sawtooth graph. A few times a day the refresh command will run during the Hive import. This causes bad things to happen.

Whilst the insert overwrite command in Hive is atomic as far as Hive clients are concerned, the file movement into the production area on HDFS can take a few minutes. So if the Impala refresh command runs during the Hive insert, it will only pick up files that have actually been written - which might be just the very first file, with data from the first few years of the company’s history.

And this is the dirty secret. Whilst you can read HDFS data using Hive, Spark, Impala, Pig, etc. Unless you’re using the same tool as the writer, you’re bound to hit this sort of problem pretty frequently.

So how do we resolve it?

Well, the first and obvious option is to use a single tool. Migrate all your pipelines to the same tool that your users read data through - Spark or Impala would be a good choice. Except Spark is hard for end-users and BI tools to use, and Impala means lots of SQL which is hard to unit test. Plus people in finance like SQL, whereas data scientists like R and Spark.

The other option is to copy all of your data into a second storage location, using whatever tool the users use. We’re doing this in limited cases, and it comes with other benefits too. For instance, queries over the past 12 months work well on Parquet files with large partitions whereas the regular moving in and out of small amounts of data works well with smaller partitions and RCfiles.

Cloudera have come up with another option recently, a workflow involving views to flip between tables. It largely depends on using Impala for both read and write, but could be extended to use a combination of tools if necessary. It’s a considerable amount of complication to what should be a simple process though.

Finally, if all we cared about was generating this graph then we could just emit the value when the pipeline runs. There are cases where this makes sense, but users generally care about having the full data rather than a single graph.

In truth, none of these options are perfect. But a combination of them is where we are planning on heading over the coming months.

Of course there is the option of doing nothing - only give your users access to data which isn’t changing, e.g. yesterday’s data. But isn’t the whole point of Hadoop that you can give more real-time access to all this wonderful data?

Want to read more about Big Data? Check out these related articles...

13 Jan '17

Scaling Time Series Databases

We collect a lot of metrics about our production systems using Graphite Times Series Databases. In order to improve performance of Graphite and reduce the load on our SAN we purpose-built and tuned some very vast dedicated hardware for our Graphite Databases.

Authors:

John Denholm and Gary Mulder

Time:

16 minute read

25 Nov '16

Our Top 10 Big Data News Sources

Keeping on top of an area of technology that is as rapidly moving as the big data ecosystem is hard. Our data tribe share some of their resources for keeping up to date.

Author:

Alice Kaerast

Time:

5 minute read

5 Apr '16

Big Data Spain or how I used my Tech Ninja Fund

We sent Software Engineer Iker Gomez to Big Data Spain conferences in Madrid to learn more about big data technologies and real-time processing.

Author:

Iker Gomez

Time:

9 minute read

3 Feb '16

Kafka Cluster Sizing

We’re starting to use Kafka for a number of projects. We can start off on virtual machines on our shared VMWare cluster, but we expect the disk IO to soon reach levels that will make it unsuitable for running on our shared storage. This post looks at some techniques for sizing up a physical Kafka cluster.

Author:

Alice Kaerast

Time:

3 minute read

Meet the author

Alice Kaerast

Senior Automation Engineer · Infrastructure Tribe

Worked at Sky Betting & Gaming 2012–2019

Alice played a key role in building our Hadoop-based data platform as a DevOps Engineer and then Architect, and then applied this engineering experience in the infrastructure tribe on Apache Kafka and other shared platforms.

Tweets by @SBGTechTeam