1 00:00:08,929 --> 00:00:14,670 Welcome back to Backspace Academy. Amazon Kinesis it is AWS service for 2 00:00:14,670 --> 00:00:21,330 handling streaming live data content. It consists of Kinesis data streams for 3 00:00:21,330 --> 00:00:26,510 pure data and video streams for streaming video content to devices and 4 00:00:26,510 --> 00:00:33,210 data files which will direct streams of data to durable storage such as Amazon 5 00:00:33,210 --> 00:00:38,610 s3 and finally Kinesis data analytics which enables the processing and 6 00:00:38,610 --> 00:00:45,210 analytics of streaming data content. Kinesis data streams they collect and 7 00:00:45,210 --> 00:00:50,789 process large streams of data in real time and it's generated continuously by 8 00:00:50,789 --> 00:00:55,920 up to thousands upon thousands of data sources. Your applications can process 9 00:00:55,920 --> 00:01:00,600 that data in that stream sequentially and incrementally on a record by record 10 00:01:00,600 --> 00:01:05,760 basis if you'd like to do every single record or you can do it over a sliding 11 00:01:05,760 --> 00:01:09,659 time window for example if you just wanted to do the last five minutes of 12 00:01:09,659 --> 00:01:14,670 data to process. It is not available on the free tier and it's billed on a per 13 00:01:14,670 --> 00:01:18,659 shard basis and we'll talk more about what a shard is in the next slide 14 00:01:18,659 --> 00:01:23,220 it has server-side encryption and it's very simple to implement similar to 15 00:01:23,220 --> 00:01:27,869 Amazon s3 where you can create a key in the AWS key management service and 16 00:01:27,869 --> 00:01:34,259 associate that with your stream. Kinesis has a lot of unique benefits. It's going 17 00:01:34,259 --> 00:01:40,590 to be processing continuously live data and analyzing that data on a live basis 18 00:01:40,590 --> 00:01:46,229 it's a managed service it is elastic you can scale it up and scale it down very 19 00:01:46,229 --> 00:01:51,750 simply, it's highly durable it's replicated across three availability 20 00:01:51,750 --> 00:01:56,490 zones, it's great for implementing process and safety alarms where you want 21 00:01:56,490 --> 00:02:00,500 to be alerted based upon evidence to a problem, 22 00:02:00,500 --> 00:02:07,259 multiple applications can consume data from the same stream and you can retain 23 00:02:07,259 --> 00:02:13,050 that data from a default of one day up to a maximum of seven days which is at 24 00:02:13,050 --> 00:02:16,970 an addition cost. A Kinesis data stream 25 00:02:16,970 --> 00:02:21,900 implementation will consist firstly of a producer that will collect data and 26 00:02:21,900 --> 00:02:27,690 continually push that data that live data into the Kinesis data stream and 27 00:02:27,690 --> 00:02:32,180 then on the output of the stream consumers will be continuously 28 00:02:32,180 --> 00:02:38,550 processing that data in real-time a stream it's made up of shards and shards 29 00:02:38,550 --> 00:02:43,380 are simply a mini streams within that stream so the more shards you have 30 00:02:43,380 --> 00:02:49,799 inside a stream the greater the capacity of that stream each shard is uniquely 31 00:02:49,799 --> 00:02:55,860 identified and it contains a continuous sequence of data records those data 32 00:02:55,860 --> 00:03:00,390 records will have a sequence number to identify where they are located in the 33 00:03:00,390 --> 00:03:05,819 shard a partition key which will help to identify which shard within the stream 34 00:03:05,819 --> 00:03:10,739 that record is located in and finally it will have the data blob which will 35 00:03:10,739 --> 00:03:15,330 contain the data that we're going to be streaming. A stream can be created using 36 00:03:15,330 --> 00:03:20,130 the management console or using the create stream software development kit 37 00:03:20,130 --> 00:03:25,829 or command-line interface operation. On the left-hand side there we've got 38 00:03:25,829 --> 00:03:30,630 producers and they could be an ec2 instance they could be a desktop client 39 00:03:30,630 --> 00:03:35,959 they could be a mobile application provided it has an application that can 40 00:03:35,959 --> 00:03:40,380 communicate with the Kinesis service such as the AWS software development kit 41 00:03:40,380 --> 00:03:44,880 or one of the libraries that are available for Kinesis and that will 42 00:03:44,880 --> 00:03:51,989 continuously push data to the Amazon Kinesis stream. Now a lot of people can 43 00:03:51,989 --> 00:03:56,130 be confused about the concept of a shard and the reason being is that a lot of 44 00:03:56,130 --> 00:04:01,799 the diagrams are a little bit confusing, so a shard it's not a piece of data, it's 45 00:04:01,799 --> 00:04:06,420 not a finite piece of data, it's a continuous stream of data, so it is in 46 00:04:06,420 --> 00:04:10,620 itself a mini stream, so there you can see you've got your Amazon Kinesis 47 00:04:10,620 --> 00:04:15,269 stream in blue and that inside that will be these shards which are themselves 48 00:04:15,269 --> 00:04:22,080 mini streams and so that data will pass through those shards so the more shards 49 00:04:22,080 --> 00:04:26,669 you have the more data you can get through your Kinesis stream the less 50 00:04:26,669 --> 00:04:28,500 shards are obviously you're going to be more 51 00:04:28,500 --> 00:04:33,510 restricted and those records that are coming from the producer will be pushed 52 00:04:33,510 --> 00:04:40,170 into those shards on the output of our Amazon Kinesis stream we have consumers 53 00:04:40,170 --> 00:04:43,710 and they are applications that we'll be reading continuously 54 00:04:43,710 --> 00:04:49,080 from our Kinesis stream and they will be then forwarding their information over 55 00:04:49,080 --> 00:04:54,690 to either durable storage such as Amazon s3 or to a processing service such as 56 00:04:54,690 --> 00:04:59,790 Amazon EMR or we could also have the Amazon Kinesis fire hose which can 57 00:04:59,790 --> 00:05:07,200 direct that stream to other AWS services such as Amazon s3. Writing to Kinesis 58 00:05:07,200 --> 00:05:13,320 streams each shard it can support up to 1,000 put records per second and it 59 00:05:13,320 --> 00:05:18,420 provides a capacity of one megabyte per second so you need to take that into 60 00:05:18,420 --> 00:05:22,740 consideration depending on the amount of data that you're going to be putting 61 00:05:22,740 --> 00:05:27,930 through your Kinesis stream the number of shards required is specified when a 62 00:05:27,930 --> 00:05:34,710 stream is created now when a producer pushes data to a stream it will need to 63 00:05:34,710 --> 00:05:40,140 provide a partition key and that's simply a string that you define in your 64 00:05:40,140 --> 00:05:45,660 puppy or producer application so it could be xyz1, xyz2 whatever 65 00:05:45,660 --> 00:05:50,490 but you organize your data based upon this partition key and then the Kinesis 66 00:05:50,490 --> 00:05:57,150 data service will map that partition key to a specific shard, so the next time 67 00:05:57,150 --> 00:06:02,460 that you push data with that same partition key prefix for that record it 68 00:06:02,460 --> 00:06:06,690 will go to the same shard which makes it easier for your application on the other 69 00:06:06,690 --> 00:06:12,060 end to retrieve that data. A sequence number it will be a unique record 70 00:06:12,060 --> 00:06:17,520 identifier and it is assigned by Kinesis it's not assigned by your producer it's 71 00:06:17,520 --> 00:06:22,110 assigned by the Kinesis applique or Kinesis service when that record is 72 00:06:22,110 --> 00:06:26,640 created and we create records normally with the sdk with a put records command 73 00:06:26,640 --> 00:06:32,970 and finally a shard iterator is supplied by the Kinesis service and it identifies 74 00:06:32,970 --> 00:06:39,990 a position in the shard to start reading data for that record so the process for 75 00:06:39,990 --> 00:06:44,310 a producer will collect live data and it will run a putrecords 76 00:06:44,310 --> 00:06:49,560 command and that will need to have the name of the stream it will need 77 00:06:49,560 --> 00:06:54,150 to supply a partition key and obviously it will need to supply the data that is 78 00:06:54,150 --> 00:06:59,940 going into the stream. Reading Kinesis streams. Each shard it can support only 79 00:06:59,940 --> 00:07:04,590 five transactions per second so basically we're reading a chunk of 80 00:07:04,590 --> 00:07:09,450 records at a time normally one chart it provides a capacity of two megabytes per 81 00:07:09,450 --> 00:07:15,630 second of data reads the shard iterator it's required to read a stream and it 82 00:07:15,630 --> 00:07:21,120 specifies the position in the shard from which to start reading those data 83 00:07:21,120 --> 00:07:26,430 records sequentially we can specify different shard iterator types depending 84 00:07:26,430 --> 00:07:31,110 on where in the stream that we want to get this data so we can specify it at a 85 00:07:31,110 --> 00:07:36,840 specific sequence number or after a specific sequence number we can specify 86 00:07:36,840 --> 00:07:42,060 it in a specific point in time with a timestamp we can do it at the trim 87 00:07:42,060 --> 00:07:46,410 horizon which will give us the oldest data in the stream or we can do it at 88 00:07:46,410 --> 00:07:51,000 the latest data if we want to get say for example the last five minutes of the 89 00:07:51,000 --> 00:07:56,100 latest data now that shard iterator it will be passed to get records so that 90 00:07:56,100 --> 00:08:00,030 can be through the software development kit or it can be done for you through a 91 00:08:00,030 --> 00:08:06,030 Kinesis client library now get records it will return the milliseconds behind 92 00:08:06,030 --> 00:08:11,550 the latest record and it will also give the next shard iterator and it will then 93 00:08:11,550 --> 00:08:16,080 finally give an array of the records which will include the timestamp for the 94 00:08:16,080 --> 00:08:21,480 records the actual data and then the sequence number of the records so the 95 00:08:21,480 --> 00:08:27,060 process if we're using the one of the software development kits or the Kinesis 96 00:08:27,060 --> 00:08:32,040 streams API is that we first need to call get shard iterator which will give 97 00:08:32,040 --> 00:08:37,380 us that shard iterator we need to supply the ID of the shard the shard iterator type 98 00:08:37,380 --> 00:08:42,420 remember if it's at timestamp or it's the latest or it's the oldest or 99 00:08:42,420 --> 00:08:46,680 whatever and then the actual name of the stream itself if we've provided 100 00:08:46,680 --> 00:08:52,200 information to get shard iterator then we can use that shard iterator with a 101 00:08:52,200 --> 00:08:56,699 get records command to actually retrieve those records and so 102 00:08:56,699 --> 00:09:00,660 they'll get records again it will return the milliseconds behind the latest 103 00:09:00,660 --> 00:09:06,149 and then the next shard iterator so we can use that to then go in and get our 104 00:09:06,149 --> 00:09:11,369 next a lot of records and it then of course will supply an array of those records 105 00:09:11,369 --> 00:09:17,399 there are a number of tools that can help us to develop and implement our 106 00:09:17,399 --> 00:09:22,619 Kinesis applications, the first one there is the AWS software development kits 107 00:09:22,619 --> 00:09:27,420 we also have a Kinesis client library which is recommended by AWS and that is 108 00:09:27,420 --> 00:09:33,389 available for Java Python Ruby dotnet and nodejs and that is for developing 109 00:09:33,389 --> 00:09:39,509 Kinesis consumer applications, we have the Kinesis connector library and they 110 00:09:39,509 --> 00:09:46,499 integrate streams with dynamo DB redshift Amazon s3 and elasticsearch 111 00:09:46,499 --> 00:09:52,019 we also have the Kinesis producer library and that simplifies the combining of 112 00:09:52,019 --> 00:09:56,069 records so if we've got a lot of small records it can combine those into run 113 00:09:56,069 --> 00:09:59,910 records so we're going to get the maximum effectiveness and efficiency out 114 00:09:59,910 --> 00:10:04,290 of our stream and then it can write that force to those streams so it does 115 00:10:04,290 --> 00:10:09,299 simplify a lot of that heavy lifting for it, there is also the Kinesis agent 116 00:10:09,299 --> 00:10:14,189 which is a pre-built java application that can be used to collect and send 117 00:10:14,189 --> 00:10:19,949 data to a stream so it's great for creating producer applications and we 118 00:10:19,949 --> 00:10:24,269 finally have the Kinesis data generator which is great if you want to test your 119 00:10:24,269 --> 00:10:30,470 consumer application because it can produce a test data to a stream for you 120 00:10:30,470 --> 00:10:36,149 Kinesis it's a fully managed service and makes it really easy to scale up or 121 00:10:36,149 --> 00:10:40,559 scale down depending on our circumstances and that process of 122 00:10:40,559 --> 00:10:45,509 scaling is called reshoring and we do that with an update shard count 123 00:10:45,509 --> 00:10:49,139 command in the software development kit or the command-line 124 00:10:49,139 --> 00:10:54,629 interface and that will increase or decrease the number of shards that are 125 00:10:54,629 --> 00:11:00,389 in our stream there are two types of reshoring processes a shard split will 126 00:11:00,389 --> 00:11:05,610 increase the show count and a shard merge will decrease the shard count 127 00:11:05,610 --> 00:11:11,670 now the Kinesis client library it can't initiate a resharding process but 128 00:11:11,670 --> 00:11:16,670 it can adapt when it occurs so if the number of chance decreases or increases 129 00:11:16,670 --> 00:11:21,630 it can adapt to that now Kinesis it will handle the mapping 130 00:11:21,630 --> 00:11:27,720 of partition keys to that new shard count as well. It is possible to automate the 131 00:11:27,720 --> 00:11:32,910 process of scaling Kinesis if you can implement an SDK application that can 132 00:11:32,910 --> 00:11:38,670 monitor using cloudwatch and use a cloud watch alarm to initiate that 133 00:11:38,670 --> 00:11:45,030 resharding process. During the resharding process a shard will go through a number 134 00:11:45,030 --> 00:11:50,910 of different statuses, before resharding occurs the shard it will be in the open 135 00:11:50,910 --> 00:11:56,610 state so we can read and write to that shard, after a retarding operation the 136 00:11:56,610 --> 00:12:00,780 shard will then transition to the closed state and we can no longer add records 137 00:12:00,780 --> 00:12:05,250 to that shard and any new records will be added to the new shards that have 138 00:12:05,250 --> 00:12:11,280 been created, after our streams retention period has expired for example after one 139 00:12:11,280 --> 00:12:17,040 day that parent will no longer be accessible for reading or writing and it 140 00:12:17,040 --> 00:12:21,930 will not contain data, so it's very important that when you have a shard 141 00:12:21,930 --> 00:12:26,790 that has been closed make sure that it doesn't still have data left in it if 142 00:12:26,790 --> 00:12:32,070 it's if that data is important to you so this is how the process works so we 143 00:12:32,070 --> 00:12:36,930 start off with a shard and it will be in the open state we can't read and we can 144 00:12:36,930 --> 00:12:41,490 write to that shard not a problem then we go through the process of rewriting 145 00:12:41,490 --> 00:12:48,750 by doing the update shard count command when that occurs two new shards will be 146 00:12:48,750 --> 00:12:52,890 created and we've got their shard B and shard C and they will be in the open 147 00:12:52,890 --> 00:12:59,490 state so that means that they will be accepting records from producers but our 148 00:12:59,490 --> 00:13:05,070 original shard, shard a, will now be closed so we can no longer add any 149 00:13:05,070 --> 00:13:13,350 records to shard a but we can still read records from shard a so when we go 150 00:13:13,350 --> 00:13:19,260 through this process there may still be records inside shard a so those records 151 00:13:19,260 --> 00:13:22,190 are important to you you need to continue 152 00:13:22,190 --> 00:13:28,190 to read from that close parent until all those shards have been exhausted from 153 00:13:28,190 --> 00:13:33,410 that stream and finally after the retention period be it one day or a 154 00:13:33,410 --> 00:13:38,780 maximum of seven days whatever you have defined for your stream the Shard a there 155 00:13:38,780 --> 00:13:43,730 will actually no longer be accessible and you will have lost any data that was 156 00:13:43,730 --> 00:13:47,750 in Shard a so again that data is important to you to make sure that you 157 00:13:47,750 --> 00:13:53,780 get it before the retention period expires. Now when we implement Kinesis 158 00:13:53,780 --> 00:13:59,210 data streams we need to first off understand what capacity we need and how 159 00:13:59,210 --> 00:14:03,830 many shards we need for this Kinesis stream so there is a very quick process 160 00:14:03,830 --> 00:14:08,630 that we can go through to identify the number of shards that we need so first 161 00:14:08,630 --> 00:14:14,330 off we need to identify what our average record size is rounded to the nearest 162 00:14:14,330 --> 00:14:20,630 one kilobyte next we need to identify how many records per second will be 163 00:14:20,630 --> 00:14:26,600 going into our stream and then from there we can work out our incoming right 164 00:14:26,600 --> 00:14:31,010 bandwidth in kilobytes per second and that will be the size or the average 165 00:14:31,010 --> 00:14:35,900 size of those records rounded up to the nearest kilobyte multiplied by the 166 00:14:35,900 --> 00:14:42,620 number of Records per second next we need to identify the number of Kinesis 167 00:14:42,620 --> 00:14:47,990 applications that will concurrently be consuming data from the output of that 168 00:14:47,990 --> 00:14:52,640 stream and so we're going to call that numbers here and from there we can 169 00:14:52,640 --> 00:14:57,590 identify our read bandwidth and that will simply be the number of these 170 00:14:57,590 --> 00:15:03,920 consumer applications multiplied by our write bandwidth now if we remember from 171 00:15:03,920 --> 00:15:10,640 our previous slides the capacity of a shard in write will be one megabyte per 172 00:15:10,640 --> 00:15:16,070 second and in read will be two megabytes per second so the number of shards that 173 00:15:16,070 --> 00:15:22,550 we require will be the maximum of either our write bandwidth divided by 1000 174 00:15:22,550 --> 00:15:30,670 being 1,000 kilobytes or a megabyte or the read bandwidth divided by 2,000 175 00:15:31,839 --> 00:15:36,769 Kinesis video streams is another offering in the Kinesis service and that 176 00:15:36,769 --> 00:15:43,999 streams live video from devices to the AWS cloud and not only video it can also 177 00:15:43,999 --> 00:15:51,319 stream non video time serialized data to the AWS cloud as well. It enables 178 00:15:51,319 --> 00:15:57,769 data encryption at rest but it has a limitation in that he only supports MKV 179 00:15:57,769 --> 00:16:04,369 video format. Kinesis video streams are made up of fragments and those fragments 180 00:16:04,369 --> 00:16:11,089 are self-contained sequences of video frames and we use a putmedia command to 181 00:16:11,089 --> 00:16:17,959 write those fragments to the video stream the Kinesis service will assign a 182 00:16:17,959 --> 00:16:23,209 fragment number and timestamps to our fragments for sequential organizing of 183 00:16:23,209 --> 00:16:28,249 our fragments in that video stream and then we can use a get media command to 184 00:16:28,249 --> 00:16:34,639 read those fragments from the stream at a specific start selector location which 185 00:16:34,639 --> 00:16:39,920 we supply and it will then return the fragments along with the timing 186 00:16:39,920 --> 00:16:44,149 information for those fragments so here we can see there are a couple of 187 00:16:44,149 --> 00:16:47,929 applications that we can use it with first off there we've got a mobile 188 00:16:47,929 --> 00:16:53,899 device that can be streaming video data to Amazon Kinesis and that video stream 189 00:16:53,899 --> 00:16:58,639 could be consumed by ec2 instances and they could be doing that on a continuous 190 00:16:58,639 --> 00:17:03,589 basis or they could be doing it on a batch basis so the second example there 191 00:17:03,589 --> 00:17:08,539 we've got a CCTV camera on the public internet and again it's streaming data 192 00:17:08,539 --> 00:17:15,980 it's producing data to the AWS Kinesis video stream and from there it is being 193 00:17:15,980 --> 00:17:21,140 pushed out to the public Internet with a batch or a continuous consumer 194 00:17:21,140 --> 00:17:24,589 application that could be running on someone's mobile device it could be 195 00:17:24,589 --> 00:17:28,789 running on someone's server but it is pushed out through to the public 196 00:17:28,789 --> 00:17:35,029 Internet. We have a number of tools that are available for Kinesis video streams 197 00:17:35,029 --> 00:17:40,399 the first one being the producer client which is for Java and Android 198 00:17:40,399 --> 00:17:44,960 applications we also have the producer library which is include 199 00:17:44,960 --> 00:17:52,429 within that Jabar an Android producer client we have the C++ producer library 200 00:17:52,429 --> 00:17:59,960 and we also have a parser library for reading MKV data from these video 201 00:17:59,960 --> 00:18:05,720 streams also the AWS software development kits have a Kinesis video 202 00:18:05,720 --> 00:18:12,559 media class that can be used for reading the MKV data from a videos video stream 203 00:18:12,559 --> 00:18:17,330 but it can only be used for in getting media it cannot be used for putting 204 00:18:17,330 --> 00:18:20,559 media into a stream 205 00:18:21,149 --> 00:18:27,580 Kinesis data fire hose allows you to deliver your streaming data to a number 206 00:18:27,580 --> 00:18:32,159 of different destinations directly including Amazon s3, Redshift 207 00:18:32,159 --> 00:18:37,529 Elasticsearch and Splunk so there is no need for you to write applications 208 00:18:37,529 --> 00:18:41,860 consumer applications to do that for you or to manage the resources that are 209 00:18:41,860 --> 00:18:46,450 needed for that you can simply set up a data file house and it will direct that 210 00:18:46,450 --> 00:18:52,480 data through to that destination for you now that data can also be transformed 211 00:18:52,480 --> 00:18:57,850 before it's delivered using lambda code that you supplied and if you do that you 212 00:18:57,850 --> 00:19:04,269 can also backup the original source data to an Amazon s3 bucket the incoming 213 00:19:04,269 --> 00:19:09,249 stream of data will be buffered before delivery and you can define that in 214 00:19:09,249 --> 00:19:15,909 megabytes with buffer size or in seconds of data with buffer interval and 215 00:19:15,909 --> 00:19:19,740 firehose also supports server-side encryption 216 00:19:19,740 --> 00:19:24,789 so here we can see we're going to be getting a stream of records coming into 217 00:19:24,789 --> 00:19:30,610 our firehose out Kinesis firehose delivery stream and then we can apply a 218 00:19:30,610 --> 00:19:35,139 lander function to transform those records and that will go to our 219 00:19:35,139 --> 00:19:40,690 destination bucket and if we apply a transformation we can also save our 220 00:19:40,690 --> 00:19:46,179 original source records to a backup s3 bucket or if we're not transforming we 221 00:19:46,179 --> 00:19:51,100 can just send that data directly to that destination bucket and remember after 222 00:19:51,100 --> 00:19:55,960 being transformed it doesn't have to go to Amazon s3 can go to another service 223 00:19:55,960 --> 00:20:00,129 it can go to redshift or elasticsearch it doesn't necessarily have to go to s3 224 00:20:00,129 --> 00:20:04,529 but the backup will go to Amazon s3 225 00:20:05,809 --> 00:20:12,720 Kinesis data analytics will process and analyze streaming data using standard 226 00:20:12,720 --> 00:20:18,240 SQL statements which is great because all of a sudden you can have analytics 227 00:20:18,240 --> 00:20:24,539 of live data and you can report that instantaneously it supports a number of 228 00:20:24,539 --> 00:20:29,400 different destinations including Amazon Kinesis data fire hose so it can go to 229 00:20:29,400 --> 00:20:36,090 s3 redshift elasticsearch but it also can go to an AWS lambda function and it 230 00:20:36,090 --> 00:20:42,240 also can go to another Kinesis data stream as a destination so here we can 231 00:20:42,240 --> 00:20:46,320 see we've got our inputs on the left there to our Amazon Kinesis analytics 232 00:20:46,320 --> 00:20:51,659 application we have our streaming inputs which can come from a Kinesis stream or 233 00:20:51,659 --> 00:20:56,700 from a firehose delivery stream and we can also have reference data they're 234 00:20:56,700 --> 00:21:03,000 located in an s3 bucket and so our application code which will be our SQL 235 00:21:03,000 --> 00:21:08,789 statements will process that and then deliver that out to an application 236 00:21:08,789 --> 00:21:15,270 output stream and it will also have an application error stream and those will 237 00:21:15,270 --> 00:21:20,850 be inputted to our application output which will be either another Amazon 238 00:21:20,850 --> 00:21:25,919 Kinesis stream or it can also be a firehose delivery stream which will then 239 00:21:25,919 --> 00:21:31,230 deliver that out to another destination such as an Amazon s3 bucket or redshift 240 00:21:31,230 --> 00:21:37,620 or elasticsearch and so that brings us to the end of quite a big lecture and I 241 00:21:37,620 --> 00:21:42,950 hope you've learned a lot and I look forward to seeing you in the next one