Just
a place to put some notes on the “AWS Certified Solutions Architect - Associate
(New!)” course from https://linuxacademy.com
AWS Analytic Services
Image: Architecture: Analytic Services
Kinesis
Image: Architecture: Kinesis
Kinesis
Essentials:
- Kinesis is a real-time data processing
service that continuously captures (and stores) large amounts of data that can
power real-time streaming dash boards
- Using the AWS provided SDKs, you can
create real-time dashboards, integrate dynamic pricing strategies, and export
data from Kinesis to other AWS services
- Including:
-- EMR (analytics)
-- S3 (storage)
-- RedShift (big data)
-- Lambda (event driven actions)
Kinesis Components
- Stream
- Producers (data creators)
- Consumers (data consumers)
- Shards (processing power)
Kinesis
Benefits:
-
Real-time processing:
-- Continuously collect and build
applications that analyze the data as its generated
-
Parallel processing:
-- Multiple Kinesis applications can be
processing the same incoming data stream concurrently
-
Durable:
-- Kinesis synchronously replicates the streaming
data across three data centers within a single AWS region and preserves the data
for up to 24 hours
-
Scales:
-- Can stream from as little as a few
megabytes to several terabytes per hour
When
to use Kinesis:
- Gaming:
-- Collect gaming data such as player
actions and feed the data into the gaming platform, for example a reactive
environment based off real-time actions of the player.
- Real-time analytics:
-- Collect IOT (sensors) from many
sources and high amounts of frequency and process it using Kinesis to gain
insights as data arrives in your environment
- Application alerts:
-- Build a Kinesis application that
monitors incoming application logs in real-time and trigger events based off
the data
- Log / Event Data collection:
-- Log data from any number of devices
and use Kinesis application to continuously process the incoming data, power,
real-time dashboards and store the data in S3 when completed
- Mobile data capture:
-- Mobile applications can push data to
Kinesis from countless number of devices which makes the data available as soon
as it is produced
Kinesis
Producers:
- Producers are devices that collect
data for Kinesis processing
- You build producers to continuously
input data into a Kinesis stream
- Producers can include (but not limited
to):
-- IoT Sensors
-- Mobile devices (cell phones)
- You can have literally thousands of
different producers and scale based on your need
-- The more data you want to process,
the more “shards” you add to your Kinesis stream
-- Each “shard” can process 2MB of read
data per second, and 1MB of write data per second
Kinesis
Consumers:
- Consumers consume the stream’s data
- This is done concurrently (multiple
consumers can consume the same data at the same time)
- Consumers include (but are not limited
to):
-- Real-time dashboards
-- S3
-- Redshift (data warehouse)
-- EMR
- Any application (one you create) can
consume the streams’ data
- Kinesis keeps 24 hours of streaming
data stored by default, but can be configured to store up to 7 days
Elastic Map Reduce
Image: Architecture: Elastic Map Reduce
Elastic
MapReduce Essentials:
- Amazon EMR is a service which deploys
out EC2 instances based off of the Hadoop big data framework
- EMR is used to analyze and process
vast amounts of data
- EMR also supports other distributed
frameworks, such as:
-- Apache Spark
-- HBase
-- Presto
-- Flink
General EMR Workflow
- Data stored in S3, DynamoDB, or
Redshift is sent to EMR
- The data is mapped to a “cluster”
of Hadoop Master/Slave nodes for processing
- Computations (coded/created by the
developer) are used to process the data
- The processed data is then reduced
to a single output set of return information
Other Important EMR Facts
- You (the admin) have the ability to
access the underlying operating system
- You can add user data to EC2 instances
launched into the cluster via bootstrapping
- EMR takes advantage of parallel
processing for faster processing of data
- You can resize a running cluster at
any time, and you can deploy multiple clusters
EMR
Master node:
- A node that manages the cluster by
running software components which coordinate the distribution of data and tasks
among other (slave) nodes for processing
- The master node tracks the status of
tasks and monitors the health of the cluster
EMR
Slave Nodes:
There
are two types of slave nodes:
-
Core node:
-- A slave node has software components
which run tasks AND stores data in the Hadoop Distributed File System (HDFS) on
your cluster
-- The core nodes do the “heavy lifting”
with the data
-
Task node:
-- A slave node that has software
components which only run tasks
-- Tasks nodes are optional
EMR
Map Phase:
- Mapping is a function that defines the
process which splits the large data file for processing
- During the mapping phase, the data is
split into 128MB “chunks”
- The larger the instance size used in
our EMR cluster, the more chunks you can map and process at the same time
- If there are more chunks than
nodes/mappers, the chunks will queue for processing
EMR
Reduce Phase:
- Reducing is a function that aggregates
the split data back into one data source
- Reduced data needs to be stored (in a
service like S3) as data processed by the EMR cluster is not persistent
Quiz: Analytics Quiz
Q: If you want to process data in
real-time, what AWS service should you use?
A: Kinesis
E:
Kinesis is AWS's service for processing data in real-time and outputting it to
a dashboard or other AWS services.
T: In EMR, data is mapped to a cluster
of master/slave nodes for processing.
Q: If your Kinesis stream needs
additional processing power, what component will you need to add more of?
A: Shards
E:
You can scale out a Kinesis stream by adding more "shards".
Q: In what two scenarios would you want
to use AWS Kinesis?
A1: Mobile data capture
A2: Capturing gaming data.
E:
Kinesis is great for collecting gaming data, such as player actions, and
capturing data from IoT sensors and mobile devices.
T: EMR is a service which deploys out
EC2 instances based on the Hadoop
framework, and also supports Apache
Spark, HBase, Presto, and Flink.
T: A Kinesis consumer can include AWS
services such as Redshift and S3.
E:
Consumers can include Redshift and S3, but also other services like DynamoDB or
a real-time dashboard/Kinesis enabled app.
Q: What is the purpose of a Kinesis
producer?
A: To collect and send data into a
Kinesis stream.
E:
Kinesis producers include things like IoT sensors and mobile devices that
collect data and send it into the Kinesis stream.
T: EMR allows you to access the
underlying operating system.
Comments
Post a Comment