“Life does not happen in batches …” introducing World of Streaming

There was a good turnout in the evening at this event, with speakers from MapR, Confluent and DataTorrent talking about streaming technology. Considering this was an evening after working hours, it was good to see a large, enthusiastic audience with almost equal representation of men and women. The audience was introduced to some key streaming technologies, such as MapR Streams, Apache Flink, Apache Kafka, Apache Kafka Connect and Apache Apex, and we learned how these streaming technologies enable real time analytics.

streaming01-800

‘Streaming Technologies’ audience, October 2016

The event started with an explanation of why people increasingly work with streaming data and how a stream-first architecture affords both flexibility and the chance for real time insights. “Streaming data” refers to data from continuous events that is collected as a message stream and may also be processed as a stream.

Understanding the benefits streaming technologies provide is valuable for solving everyday challenges we sometimes take for granted. A comparison between old school map navigation systems vs today’s navigation technologies with real-time traffic feedback in our phones provided an interesting reminder of the opportunities this technology offers to take advantage of the time-value of data. At the heart of a modern, stream-based architecture is a message transport technology with the right capabilities. Two examples of this type of message transport are Apache Kafka and MapR Streams.

Event speakers

Event speakers Pramod Immaneni, Ellen Friedman and Gwen Shapira

The talk on MapR Streams highlighted similarities to Kafka, such as accepting data from multiple data producers and making it available to multiple consumers either immediately or later. This decoupling of producers and consumers is important: both MapR Streams and Apache Kafka support a microservices style of work. As with Kafka, MapR Streams allows data be assigned to topics partitioned for load balancing. The talk also highlighted differences between both systems: first, MapR Streams’ native capability as a topic partition distributed across a MapR Cluster and not limited to a node, as in Kafka; second, MapR Streams’ topic management at stream level that allows topics to be collected together, with policies such as access control, time-to-live and replication applied at this level; and third, the geo-distribution replication of streams across data centers while maintaining offsets.

Another transportation example showed off this unique capability of MapR Streams for accurate geo-distributed replication: a ship traveling to different ports could collect sensor data from cargo containers to an onboard cluster and replicate that to onshore clusters in the current port and at next destination. This helps ensure data is in the correct place at the correct time.

The topic of processing streaming data was introduced with a brief overview of Apache Flink. This real time stream processing engine has excellent windowing capabilities, can process by event time, and also works for batch processing.

The talk on Apache Kafka provided a valuable understanding of data integration needs and uses. Citing an ever-growing ecosystem of tools and applications, the speaker highlighted interesting examples from her experience with customers to exemplify the flexibility Kafka brings by integrating with real-time, in-memory, containers, clouds and legacy systems. The talk also highlighted the key things that matter to customers, like reliability, timeliness, pull vs. push, scalability, data formats, security, and error handling, among others. The speaker explained in some detail how Kafka works, particularly the role of partitions for scalable consumption and how ordered consumption is carried out per partition. She also introduced Kafka Connect for large scale streaming import/export with Kafka and showed the large collection of connectors that have been developed.

streaming03-400

Blog contributor Srabasti Banerjee is a Software Developer with over ten years of experience in development, testing and implementation of application software using Big Data and Oracle technologies.

The talk on Apache Apex highlighted how Apex can serve as a powerful and versatile platform for big data processing. Common usages of Apex includes Big Data ingestion, streaming analytics, ETL, fast batch processing, real-time alerts and actions, threat detection, etc. The talk covered Apex platform architecture, features such as scalability and fault-tolerance, how to develop an application on Apex using the API that provides high throughput with low latency, using pre-built connectors to external systems such as Kafka, and partitioning high volume streams. There was also a brief overview of the productivity and operational tools that DataTorrent provides on top of Apache Apex in the DataTorrent RTS Enterprise Edition software.

Following the presentations, the audience raised thoughtful questions and the speakers provided detailed responses for an interesting discussion.

Thanks to the speakers, Ellen Friedman, Gwen Shapira and Pramod Immaneni, organizers and sponsors for the excellent content, food and giveaways!

Below are links to the presentations:
– Pramod Immaneni: Intro to Apache Apex

– Ellen Friedman: Streaming Goes Mainstream: Transport, Processing, and Architecture

– Gwen Shapira: Streaming Data Integration with Apache Kafka

– There also a new short book on stream processing with Flink: Introduction to Apache Flink. A free pdf courtesy MapR is available.

Leave a Reply