1 00:00:00,710 --> 00:00:01,800 Welcome back. 2 00:00:01,800 --> 00:00:04,230 In this lesson, I'm going to be talking to you 3 00:00:04,230 --> 00:00:07,990 about how we design and implement incremental data loads. 4 00:00:07,990 --> 00:00:09,103 So let's get started. 5 00:00:10,060 --> 00:00:13,780 We'll start by digging deeper into Azure Data Factory. 6 00:00:13,780 --> 00:00:14,700 And in this lesson, 7 00:00:14,700 --> 00:00:15,800 we're going to be discussing 8 00:00:15,800 --> 00:00:19,030 what incremental data loading is. 9 00:00:19,030 --> 00:00:20,470 We're going to outline the importance 10 00:00:20,470 --> 00:00:23,225 of incremental data loading and why you should use it. 11 00:00:23,225 --> 00:00:26,940 We're going to talk about the concept of watermarks 12 00:00:26,940 --> 00:00:28,363 and why that's important. 13 00:00:29,440 --> 00:00:31,430 We're going to tie all of this together 14 00:00:31,430 --> 00:00:35,330 by examining incremental data loads in Data Factory. 15 00:00:35,330 --> 00:00:36,163 And then, of course, 16 00:00:36,163 --> 00:00:38,963 we're going to see this in action in the Azure portal. 17 00:00:41,380 --> 00:00:45,120 So let's start with the very first tip. 18 00:00:45,120 --> 00:00:46,870 Don't do this. 19 00:00:46,870 --> 00:00:49,015 This is a crazy car. 20 00:00:49,015 --> 00:00:51,490 We'll talk a little bit more about what this means, 21 00:00:51,490 --> 00:00:54,470 but let's start talking about full data loading 22 00:00:54,470 --> 00:00:56,930 versus incremental data loading. 23 00:00:56,930 --> 00:01:01,930 So full data loading is a dump of the entire dataset. 24 00:01:02,320 --> 00:01:04,910 So if we're going to load something new in, 25 00:01:04,910 --> 00:01:07,970 we're going to take our current dataset, dump it entirely, 26 00:01:07,970 --> 00:01:10,430 and we're going to completely replace it. 27 00:01:10,430 --> 00:01:11,263 That's it. 28 00:01:11,263 --> 00:01:13,280 There's no additional requirements. 29 00:01:13,280 --> 00:01:16,673 Dump what you have, completely start over, and rebuild. 30 00:01:17,790 --> 00:01:19,730 Incremental data loading on the other hand 31 00:01:19,730 --> 00:01:22,230 is more like this car over here on the right, 32 00:01:22,230 --> 00:01:23,950 don't dump anything. 33 00:01:23,950 --> 00:01:26,230 We're going to keep everything that we have, 34 00:01:26,230 --> 00:01:27,610 and we're going to look at the new stuff 35 00:01:27,610 --> 00:01:30,080 in comparison with what we currently have. 36 00:01:30,080 --> 00:01:33,140 And we're only going to add the difference. 37 00:01:33,140 --> 00:01:34,430 That's what we're going to be doing. 38 00:01:34,430 --> 00:01:37,350 So we're just going to keep piling more stuff on, 39 00:01:37,350 --> 00:01:40,480 but we're not going to replace all the stuff that we had. 40 00:01:40,480 --> 00:01:43,803 So that in a nutshell is what incremental data loading is. 41 00:01:45,410 --> 00:01:47,700 So why should we use incremental data loading 42 00:01:47,700 --> 00:01:49,603 and why shouldn't we use it? 43 00:01:50,610 --> 00:01:54,070 First, we should use it because it's faster. 44 00:01:54,070 --> 00:01:55,750 It's much, much faster. 45 00:01:55,750 --> 00:01:58,550 So think about just dumping everything and starting over 46 00:01:58,550 --> 00:02:01,210 every single time we have that full load 47 00:02:01,210 --> 00:02:03,530 versus looking at only adding a few rows 48 00:02:03,530 --> 00:02:06,833 or a few percent of the total database size. 49 00:02:08,100 --> 00:02:11,330 So why shouldn't we use incremental data loading? 50 00:02:11,330 --> 00:02:13,560 Well, it's more complicated. 51 00:02:13,560 --> 00:02:16,231 First off, it's more likely to create errors. 52 00:02:16,231 --> 00:02:18,925 Because we're only loading part of a database, 53 00:02:18,925 --> 00:02:22,660 there's a much greater chance that somewhere along the line, 54 00:02:22,660 --> 00:02:25,390 we don't capture all of the new data, or we cross over 55 00:02:25,390 --> 00:02:29,001 and we get duplicates, or it crashes in the middle. 56 00:02:29,001 --> 00:02:31,448 And if any of those things happen, 57 00:02:31,448 --> 00:02:34,610 it's very likely that we can't rerun 58 00:02:34,610 --> 00:02:37,120 the incremental data load with errors. 59 00:02:37,120 --> 00:02:39,870 So consider that if that happens 60 00:02:39,870 --> 00:02:40,703 and it crashes in the middle, 61 00:02:40,703 --> 00:02:43,180 there's a good chance you might have to just dump everything 62 00:02:43,180 --> 00:02:45,318 and start back over with a complete rebuild. 63 00:02:45,318 --> 00:02:48,620 And then finally, the timing can be troublesome 64 00:02:48,620 --> 00:02:51,558 because we have to look at exactly what we have 65 00:02:51,558 --> 00:02:54,136 and then the new stuff, figure out where that ends, 66 00:02:54,136 --> 00:02:58,325 and load that new data load in. 67 00:02:58,325 --> 00:03:01,450 Depending upon how fast the data is moving, 68 00:03:01,450 --> 00:03:03,888 depending upon how accurate we are with our timing 69 00:03:03,888 --> 00:03:07,860 and our timestamps, we could have issues there, 70 00:03:07,860 --> 00:03:10,960 which again, timing is more troublesome, of course, 71 00:03:10,960 --> 00:03:13,310 over just dumping everything and starting over. 72 00:03:15,050 --> 00:03:18,820 So now let's transition and talk about watermarks. 73 00:03:18,820 --> 00:03:21,730 And when we look at watermarks, 74 00:03:21,730 --> 00:03:24,080 a watermark is just simply a way 75 00:03:24,080 --> 00:03:27,760 that we can mark as our database changes. 76 00:03:27,760 --> 00:03:32,760 So it could be a timestamp, it could be an ID column. 77 00:03:33,220 --> 00:03:37,120 So we could look at the number of ID and as it increases, 78 00:03:37,120 --> 00:03:40,150 we could use that as a reference point or a watermark 79 00:03:40,150 --> 00:03:43,870 to determine what's changed for our incremental data loads, 80 00:03:43,870 --> 00:03:45,480 or it could be something like a timestamp. 81 00:03:45,480 --> 00:03:48,527 So every day we get a new entry, 82 00:03:48,527 --> 00:03:51,330 and we could use that as a watermark 83 00:03:51,330 --> 00:03:55,630 to determine what's changed for our incremental data loads. 84 00:03:55,630 --> 00:03:58,950 So in this lesson, I'm actually going to show you 85 00:03:58,950 --> 00:04:01,600 2 different ways to do incremental data loads. 86 00:04:01,600 --> 00:04:05,020 The first way is this way, the watermark way. 87 00:04:05,020 --> 00:04:05,853 And then the second one, 88 00:04:05,853 --> 00:04:09,350 I'm actually going to show you in Azure Data Factory 89 00:04:09,350 --> 00:04:13,140 how we can do incremental data loading using dates. 90 00:04:13,140 --> 00:04:16,100 So just keep that in mind, and keep in mind the watermark, 91 00:04:16,100 --> 00:04:18,400 and we'll talk about that more in this course, 92 00:04:18,400 --> 00:04:21,050 as we go through several different Azure services, 93 00:04:21,050 --> 00:04:24,850 but a watermark is just simply a reference point 94 00:04:24,850 --> 00:04:29,060 that we can use to determine what's changing with our data. 95 00:04:29,060 --> 00:04:30,370 So in this example, 96 00:04:30,370 --> 00:04:33,310 we would start off by selecting a watermark. 97 00:04:33,310 --> 00:04:36,030 So we would go into our SQL database 98 00:04:36,030 --> 00:04:37,850 or whatever database we're using, 99 00:04:37,850 --> 00:04:40,550 and we would select our watermark column. 100 00:04:40,550 --> 00:04:43,311 For instance, in this case, we could say it's customer ID. 101 00:04:43,311 --> 00:04:46,710 Then we would store that watermark value. 102 00:04:46,710 --> 00:04:49,285 And we would use that stored watermark value 103 00:04:49,285 --> 00:04:54,285 to figure out what's changed when we run our pipelines. 104 00:04:54,440 --> 00:04:55,273 Okay. 105 00:04:55,273 --> 00:04:58,077 So the process in Azure Data Factory would be, 106 00:04:58,077 --> 00:05:01,820 let's create 2 lookup activities. 107 00:05:01,820 --> 00:05:05,870 So we would have our last watermark over here on the left, 108 00:05:05,870 --> 00:05:08,700 and then we would have our new watermark over here 109 00:05:08,700 --> 00:05:10,230 on the left as well. 110 00:05:10,230 --> 00:05:12,890 So the last watermark would pull up the last time 111 00:05:12,890 --> 00:05:14,880 that we did an incremental data load, 112 00:05:14,880 --> 00:05:16,890 and then the new watermark would look 113 00:05:16,890 --> 00:05:20,850 and it would pull the most recent watermark that we have. 114 00:05:20,850 --> 00:05:24,750 From that, we would create a copy activity, 115 00:05:24,750 --> 00:05:27,120 and the copy activity would just simply look 116 00:05:27,120 --> 00:05:29,850 at the difference between the last and the new watermark. 117 00:05:29,850 --> 00:05:33,280 And it would load all of the data between those watermarks. 118 00:05:33,280 --> 00:05:36,871 And then finally, we would create a stored procedure. 119 00:05:36,871 --> 00:05:38,242 I talked about storing. 120 00:05:38,242 --> 00:05:40,440 We would create a stored procedure activity 121 00:05:40,440 --> 00:05:44,111 that would update our new, new watermark, right? 122 00:05:44,111 --> 00:05:46,138 So we had our old one, which is the last. 123 00:05:46,138 --> 00:05:47,423 We did some updates, 124 00:05:47,423 --> 00:05:50,350 and now we want to store our new watermark. 125 00:05:50,350 --> 00:05:51,980 So we would store that new watermark 126 00:05:51,980 --> 00:05:53,083 so the next time we ran it, 127 00:05:53,083 --> 00:05:55,610 we would be able to move forward 128 00:05:55,610 --> 00:05:57,030 and figure out what had been changed 129 00:05:57,030 --> 00:05:59,370 since this most current run. 130 00:05:59,370 --> 00:06:02,223 So that is one way that we can do incremental data loads. 131 00:06:02,223 --> 00:06:04,203 It's not wildly important 132 00:06:04,203 --> 00:06:06,560 that you understand all of the intricacies 133 00:06:06,560 --> 00:06:09,244 about how to create that copy activity at this point. 134 00:06:09,244 --> 00:06:12,252 It's more important that you understand what a watermark is, 135 00:06:12,252 --> 00:06:15,283 and you understand this process, 136 00:06:15,283 --> 00:06:17,580 that this is a possible way 137 00:06:17,580 --> 00:06:19,973 that we could do incremental data loads. 138 00:06:24,930 --> 00:06:28,590 So with that, we are actually going to pause this lesson 139 00:06:28,590 --> 00:06:31,387 because we are only about halfway through this journey. 140 00:06:31,387 --> 00:06:35,050 And then in the next part, I'm going to pick up 141 00:06:35,050 --> 00:06:37,350 and I'm going to walk you through the Azure portal 142 00:06:37,350 --> 00:06:41,790 to show you an example of how you can do an incremental load 143 00:06:41,790 --> 00:06:43,161 in the Azure portal. 144 00:06:43,161 --> 00:06:47,453 So take a break and I will see you in the next lesson.