Big data technologies have been growing exponentially over the past few years and have penetrated into every domain and industry in software development. It has become a core skill for a software engineer. Robust and effective big data pipelines are needed to support the growing volume of data and applications in the big data world. These pipelines have become business critical and help increase revenues and reduce cost.
Do quality big data pipelines happen by magic? High quality designs that are scalable, reliable and cost effective are needed to build and maintain these pipelines.
How do you build an end-to-end big data pipeline that leverages big data technologies and practices effectively to solve business problems? How do you integrate them in a scalable and reliable manner? How do you deploy, secure and operate them? How do you look at the overall forest and not just the individual trees? This course focuses on this skill gap.
What are the topics covered in this course?
We start off by discussing the building blocks of big data pipelines, their functions and challenges.
We introduce a structured design process for building big data pipelines.
We then discuss individual building blocks, focusing on the design patterns available, their advantages, shortcomings, use cases and available technologies.
We recommend several best practices across the course.
We finally implement two use cases for illustration on how to apply the learnings in the course to a real world problem. One is a batch use case and another is a real time use case.
Introduction & Expectations
Discuss the need for quality pipeline design for big data pipelines. Explore the key activities in building such a design
Familiarize with the covered topics, out-of-scope topics and pre-requisites for the course.
Discuss how serverless technologies from cloud providers relate to the contents of this course.
Building Blocks for Big Data Pipelines
Describe the overall pipeline network and the building blocks in the network
Discuss the features and challenges for the data acquisition block in a big data pipeline
Discuss the features and challenges for the data transport block in a big data pipeline
Discuss the features and challenges for the data processing block in a big data pipeline
Discuss the features and challenges for the data storage block in a big data pipeline
Discuss the features and challenges for the data serving block in a big data pipeline
Discuss the features and challenges for the pipeline infrastructure in a big data pipeline
Discuss the features and challenges for the operations block in a big data pipeline
System Design Process
Study the overall System Design Process to be followed for Big Data Pipeline Design
Explore the functional requirements provided for the use case and look for key indicators that require special attention for big data processing.
Analyze the input data to the big data pipeline to understand various characteristics like format, protocol and availability schedules
Analyze the non-functional requirements for the big data pipelines, especially those that relate to big data like scalability and fault tolerance
Create a pipeline flowchart that captures the steps and workflow needed to convert inputs to outputs
Add Big Data specific patterns and techniques to the flowchart and create a skeleton design
Analyze scaling of the skeleton architecture to ensure horizontal scalability and detect bottlenecks.
Choose the right technologies for the building blocks used in the solution
Design infrastructure, Security and Serviceability for the big data pipeline
Create a test strategy for testing the big data pipeline that covers regression, scaling and automation
Scalable Pipelines - Design Principles
Compare the characteristics of Batch Pipelines and Realtime Pipelines and analyze suitability for use cases
Distributed Architectures help ensure horizontal scalability for handling big data traffic. Discuss the key features and levers for distributed architectures
The principles of Microservices architectures still apply when designing big data pipelines. Explore key principles and how they apply to big data pipelines.
Discuss key best practices when designing batch big data pipelines
Discuss key design practices when designing realtime big data pipelines
Explore the options for benchmarking performance for a big data pipeline
Data Acquisition Design
Analyze the File Transfer Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Extraction Client Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Ingestion API Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Pub Sub Pattern for Acquisition. Discuss its advantages, shortcomings, use cases and availability technologies.
Explore Design Best Practices for Big Data Acquisition
Data Transport Design
Analyze the Extract Load Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Request Response Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Event Streaming Pattern for Data Transport. Discuss its advantages, shortcomings, use cases and availability technologies.
Explore some Best Practices for Big Data Transport Design
Data Processing & Transformation Design
Explore several Data Processing Patterns that can be used for Big Data Processing Design.
Study how Big Data Processing Engines work behind the scenes to process data in a horizontally scalable manner
Discuss best practices for designing batch processing jobs for big data processing
Discuss best practices for designing batch processing jobs for big data processing
Discuss best practices for designing stream processing jobs for big data processing
Study the differences between batch and realtime when it comes to processing jobs. Explore how design changes based on this criteria
Discuss the importance and techniques for reading inputs and writing outputs in a scalable manner inside a processing job
Compare popular processing engine technologies available in the market today.
Storage Design
Analyze the Distributed File System Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Relational Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Document Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Columnar Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Graph Database Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Distributed Cache Pattern for Data Storage. Discuss its advantages, shortcomings, use cases and availability technologies.
Discuss Data Storage Best Practices when building big data pipelines
Discuss Data Storage Best Practices when building big data pipelines
Serving Design
Analyze the Query Interface Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Serving API Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Push Client Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Analyze the Publish Subscribe Pattern for Data Serving. Discuss its advantages, shortcomings, use cases and availability technologies.
Discuss Best Practices for Data Serving when building big data pipelines
Infrastructure and Deployments
Discuss the infrastructure technologies available for deploying and operating big data technologies
Use the microservices deployment patterns for building and deploying building blocks in a big data pipeline
Discuss the deployment options for deploying processing jobs in a big data pipeline. Compare their benefits and use cases
Discuss the deployment options for deploying databases and queues in a big data pipeline. Compare their benefits and use cases
Review the use cases where geographically distributed pipelines are needed. Discuss some best practices for the same
Security
Review the principles of building security by design into big data pipelines
Explore the options and best practices for securing external interfaces in a big data pipeline
Explore the options and best practices for securing data storage in a big data pipeline
Review the privacy considerations while dealing with data inside a pipeline and the best practices to protect private data
Discuss the implications on data security and privacy when building multi-tenant applications. Review some best practices for securing multi-tenant applications
Serviceability
Review the elements of building end-to-end serviceability in the big data pipeline
Explore the components and workflow in a typical monitoring pipeline
Discuss several types of metrics that need to be collected and monitored when operating a big data pipeline
Discuss the types of problems encountered when operating a big data pipeline. Review the best practices for dealing with those issues
Use Case I : Customer Journey Analytics (CJA)
Define the business problem to solve for the Customer Journey Analytics use case
Study the requirements for the CJA use case to understand its inputs, outputs and processing requirements
Analyze the Input data for the use case to understand its format, protocol and availability
Study the non-functional requirements for the CJA use case to understand elements like scalability, security and resiliency
Draw a Pipeline Flowchart for the CJA use case, to design the data flow and processing aspects
Create a Skeleton Design for the CJA use case using the flowchart. Add design elements for Big Data patterns and integrations
Analyze the CJA Skeleton architecture to ensue that the pipeline is horizontally scalable end-to-end. Look for potential bottlenecks
Select technologies for the building blocks used in the CJA pipeline. Use the selection criteria table to compare alternatives and choose the right technology
Design the deployment patterns, security measures and serviceability elements for the CJA pipeline
Use Case II : Suspicious Login Alerting (SLA)
Define the problem for the Suspicious Login Alerting Use Case
Study the Functional requirements for the SLA use case, including its inputs, outputs and processing requirements
Analyze the Input Data for the SLA use case to understand its format, source, protocol and limitations
Explore the non-functional requirements for the SLA use case, to understand the scalability, security and other needs
Use the requirements to draw a pipeline flowchart for the SLA use case, capturing the workflow and data processing steps
Enhance the pipeline flowchart for the SLA use case, by adding patterns and scalable techniques to create a skeleton design