There’s a tendency for big data conferences to be very product-driven, being spaces for companies to sell their latest and greatest new big data platforms that are going to solve all of your problems. Even the most expensive conferences are paid for by sponsors more than ticket sales, and this leads to talks turning into sales pitches. There’s definitely a market for product-led conferences as a way to discover the commercial space and see the art of the possible, but it’s not what most data engineers want to see.
Big Data Technology Warsaw Summit was really quite eye-opening then with their rule of no product marketing. This led to the talks and booths mostly being from companies who are currently hiring, and therefore want to show off their current technology stack and their roadmaps. Alice recently went to the fourth, 2018 edition of this event, having won an expenses-paid trip from one of the organising companies - GetInData. What follows are some thoughts on some of the talks they attended.
ING have developed three key models plus enriched datasets and patterns that they can reuse with their business customers. This provides immense value and a quick start for their clients. They have used Google Sprints with companies to build and release something in a week as a way of prototyping an idea and proving value. One of the models they have built is a neural network to detect if a name is an individual or a business in order to ensure they don’t adversely affect the privacy of individuals; this model has a 99.5% accuracy, better than humans. Their platform is Airflow, Supersets, Druid, Ceph and Spark - which is very similar to that of many of the companies present at Big Data Technology Warsaw Summit this year.
Uber have recently been doing their data science development work on big single-node machines rather than clusters. By adding very fast disks plus GPUs they can outperform distributed computing for not very much money, so long as the working datasets aren’t huge. For this they use Tensorflow on GPU and Spark on CPU. Since Spark was built before anyone thought GPU computing was going to be a thing it’s somewhat lacking in the ability to use GPU hardware, but this is coming. Another option they’ve considered is the modern analytics databases which run on GPU hardware (eg. MapD), however they only support a subset of GPU which requires rewriting queries and therefore don’t consider it worthwhile.
KCell are a mobile phone operator in Kazakhstan, they’d been using a commercial blackbox solution to process and react to all of their streaming data. It was expensive, slow to change, and only coped with a very small throughput. So they worked with GetInData to build something in-house.
The new solution uses Apache NiFi for ingest and outgest, Flink for data processing, and Kafka for the message store. The raw data also gets dumped down to HDFS and then exported to Druid for querying. The whole solution is secured with Kerberos and Ranger, with ELK, InfluxDB and Grafana for monitoring. This new system can process 100 times more messages, plus it can scale up even further as required.
Flink is a data processing framework that has been around for a while now. It has both batch and streaming functionality, and has seen a number of improvements in recent years.
Flink is often compared to Spark, but Spark streaming is actually microbatching which adds a latency of 100ms. It’s also sometimes compared to Kafka Streams, but Kafka Streams has a much lower throughput and also only supports Kafka as a data source. Flink is also better at handling much larger state than either of Spark or Kafka Streams.
The latest version of Flink adds exactly once end-to-end delivery with Kafka’s new exactly-once semantics.
Thinking in Data Flows
This talk introduced the idea of Flow Based Programming. Both the lambda and kappa architectures inevitably end up dealing with the complexities of handling the technology being used, whereas Flow Based Programming brings us back to the flow of data which is a more natural way of working. By building everything out of stateless DAGs composed of queues and microservices it makes it clearer what is happening and is easier to work with. Both Apache Streams and Apache NiFi go a long way towards this ideal, and to some extent some of the commercial tools like Streamsets cover some of the same things.
Migrating large big data projects to the cloud is a hot topic at the moment, with some companies deciding the cloud is cheaper and others deciding remaining on-premise is cheaper. The upcoming GDPR regulations scare companies considering the cloud, but the speaker pointed out we already use the cloud daily in the form of email, Google Docs / Office365, and even the telephone network can be considered a form of cloud.
Adopting the cloud for big data doesn’t have to be an all-or-nothing approach. In fact, it can be hard to model the costs of moving to the cloud so a small-scale trial is a great way to get a feeling for whether it is going to work for your organisation.
- Prepare connections to cloud stores - practically free, and opens up so much possibility
- Don’t reinvent the wheel, don’t build your own tooling to move data to the cloud
- Copy critical datasets to the cloud
- Start using Big Query and GCS to show value (or the equivalent offerings from your cloud provider)
- Audit costs per query to predict future expenses of wider usage
You can and should start preparing for migration now. By packaging your deployments smartly you can make them easier to run not only in the cloud, but on different on-premise environments too. The use of docker containers and publishing JARs to the likes of Artifactory can make this considerably easier, as can tools like Spotify’s Spydra which abstracts the difference between on-premise Hadoop and Google Cloud.
Perhaps most importantly is not to consider the cloud as just more servers. Running long-lived clusters in the cloud is a big mistake, they are expensive and they miss the point of only paying when you need to actually process data. Short-lived compute power also means you never need to patch it, and this is especially true if your individual processes are short-lived enough to use serverless.
Spark on K8s and dA Platform 2 (Flink on Mesos) are both coming soon, and both could be huge if you are ready to take advantage of them.
Druid is the columnar data store with realtime data loading and sub-second queries that the likes of Impala were mis-sold as. It allows analytics queries to be run across large amounts of data without the startup time that SQL on Hadoop typically has. Many companies at Big Data Technology Warsaw Summit 2018 were using Druid as an OLAP Cube for exporting data to nightly for summarised datamarts.
One particular example was an adserving company who ingest their data into a Hadoop datawarehouse via Kafka and Spark. Streaming data into HDFS means many small files, and HDFS in the cloud is expensive, so each night they compress and export the data to OpenStack Swift as well as a Druid cluster for online processing. Again, Superset was the tool of choice for visualising the data in Druid and Airflow for orchestrating all of this.
Big data Maturity Levels
|0||Latent - no use of data|
|1||Analysis - humans make decisions with data|
|2||Learning - automated decisions disconnected from direct action|
|3||Acting - real time decisioning|
Once you get to level three your requirements change massively. You now require zero downtime and very low latency, something that is difficult to achieve in most distributed computing platforms.
Vespa is a Yahoo project that was funded to provide ranking and recommendation models and search. It can be interacted with via HTTP, or Java for bulk loading. It has been powering search on many Yahoo properties for a number of years, but has only recently been open sourced.
Airflow As A Service
Allegro were running a number of different schedulers around the company, including Oozie and Airflow. Oozie isn’t userfriendly, requires writing XML, requires compiling all of your software with extra Oozie libraries to make it work, and lacks any decent UI to tell what’s happening. The company was already using Airflow in places, so they opted to standardise on this and provide it as a service internally.
They provide lots of small instances of Airflow to keep things separate, and you can fill in a form to get a new instance with a free load balancer, monitoring, logging, metrics collection, and Pager Duty integration. The service is dockerised, and can run locally as well as on their Mesos platform. They found the Docker layer below Airflow was also useful for quickly developing of any new application which requires access to data around the company. It’s relatively easy for analysts who want to schedule SQL queries to use Airflow and not worry about adhering to the companies microservices contract, because all of the requirements are met for free.
Big Data Technology Warsaw Summit covered a good amount of topics, concentrating on technology rather than products. Like Berlin Buzzwords the technology was primarily open source, but more directly about big data and datawarehousing. The organisers have clearly put a lot of thought into running this event and making it enjoyable, with both quiet and loud chillout spaces for between sessions and buffet food available at all times. Although the main summit was shorter than previous years, this is still an event that should be at the top of everybody’s list for learning about the sort of technologies that other companies are using in the big data space.