1 00:00:00,700 --> 00:00:01,690 Welcome back. 2 00:00:02,120 --> 00:00:05,050 In this video, we'll embed each chunk of 3 00:00:05,060 --> 00:00:09,710 text into numeric vectors and insert them into a Pinecone index. 4 00:00:10,360 --> 00:00:15,410 I'm importing the necessary libraries and initializing the Pinecone client. 5 00:00:16,480 --> 00:00:27,420 Import Pinecone from LangchainCommunity .VectorStores import the Pinecone class 6 00:00:30,220 --> 00:00:32,950 and I'm initializing the client. 7 00:00:35,840 --> 00:00:40,470 Pay attention, there are two classes 8 00:00:40,480 --> 00:00:46,250 named Pinecone, one in the Pinecone module and another one in 9 00:00:46,260 --> 00:00:48,490 LangchainCommunity .VectorStores. 10 00:00:51,010 --> 00:00:53,680 The next step is to create a Pinecone index. 11 00:00:54,850 --> 00:00:58,300 You can also use an existing index if you want. 12 00:00:59,170 --> 00:01:01,400 Note that if you are currently using the 13 00:01:01,410 --> 00:01:06,980 free plan, you are limited to one index and one project and if you already have 14 00:01:06,990 --> 00:01:10,360 one, you will get an error if you try to create a second one. 15 00:01:11,170 --> 00:01:16,260 So it's a good idea to delete all existing indexes before creating a new one. 16 00:01:16,890 --> 00:01:30,600 For i in PC .listIndex .names, PC .deleteIndex of i. 17 00:01:34,930 --> 00:01:40,180 And it's always a good idea to display a message for the user. 18 00:01:41,630 --> 00:01:49,450 Deleting all indexes and print, done. 19 00:01:51,810 --> 00:01:54,260 I'm also adding end equals empty string. 20 00:01:59,090 --> 00:01:59,300 Good. 21 00:01:59,950 --> 00:02:02,420 I'm creating a new index for these 22 00:02:02,430 --> 00:02:04,980 embeddings called ChurchillSpeech. 23 00:02:06,970 --> 00:02:10,320 indexName equals ChurchillSpeech. 24 00:02:16,420 --> 00:02:28,820 If indexName not in PC .listIndex .names, if the index does not exist, I will 25 00:02:28,830 --> 00:02:29,320 create it. 26 00:02:32,490 --> 00:02:39,020 Creating index and indexName in curly braces. 27 00:02:45,300 --> 00:02:56,860 And PC .createIndex and the arguments are name equals indexName, dimension equals 28 00:02:56,870 --> 00:03:19,700 1536, metric equals cosine and the spec equals pinecone .podSpec of environment 29 00:03:19,710 --> 00:03:23,360 equals gcpStarter. 30 00:03:28,120 --> 00:03:31,630 After creating the index, I am printing done. 31 00:03:33,200 --> 00:03:34,410 I am running the code. 32 00:03:36,520 --> 00:03:39,870 Note that indexNames must consist of 33 00:03:39,880 --> 00:03:46,090 lowercase alphanumeric characters or dashes and must start and end with an 34 00:03:46,100 --> 00:03:47,170 alphanumeric character. 35 00:03:49,500 --> 00:03:52,590 Next, we will upload the vectors to 36 00:03:52,600 --> 00:03:54,090 pinecone using langchain. 37 00:03:55,060 --> 00:03:57,690 We will call pinecone .fromDocuments. 38 00:04:00,500 --> 00:04:02,330 pinecone .fromDocuments. 39 00:04:03,960 --> 00:04:07,110 This method takes three arguments, 40 00:04:07,960 --> 00:04:14,040 chunks, embeddings and the indexName. 41 00:04:17,600 --> 00:04:21,250 Chunks is a list of text documents that 42 00:04:21,260 --> 00:04:25,890 have been obtained by calling the recursive character text splitter method. 43 00:04:26,320 --> 00:04:31,850 These smaller chunks will be indexed in pinecone to make it easier to search and 44 00:04:31,860 --> 00:04:35,210 retrieve relevant information later on. 45 00:04:35,920 --> 00:04:40,110 Embeddings is an instance of the OpenAI 46 00:04:40,120 --> 00:04:48,050 Embeddings class which is responsible for converting text data into embeddings 47 00:04:48,060 --> 00:04:50,690 using OpenAI's embedding model. 48 00:04:51,360 --> 00:04:54,970 These embeddings will be stored in the 49 00:04:54,980 --> 00:04:57,750 pinecone index and used for similarity search. 50 00:04:58,460 --> 00:05:00,950 And the indexName is a string 51 00:05:00,960 --> 00:05:04,270 representing the name of the pinecone index. 52 00:05:04,840 --> 00:05:07,790 This name is used to identify the index 53 00:05:07,800 --> 00:05:11,490 in pinecones database and must already exist. 54 00:05:12,380 --> 00:05:15,610 We have defined all these objects earlier. 55 00:05:18,030 --> 00:05:24,110 Chunks, embeddings and indexName. 56 00:05:25,340 --> 00:05:29,710 The pinecone .fromDocuments method also 57 00:05:29,720 --> 00:05:34,670 returns a vectorStore object initialized from documents and embeddings. 58 00:05:37,600 --> 00:05:39,190 vectorStore equals. 59 00:05:41,060 --> 00:05:44,690 In a nutshell, this method processes the 60 00:05:44,700 --> 00:05:50,930 input documents, generates embeddings using the provided OpenAI Embeddings 61 00:05:50,940 --> 00:05:54,430 instance and returns a new pinecone vectorStore. 62 00:05:54,960 --> 00:05:58,630 The resulting vectorStore object can 63 00:05:58,640 --> 00:06:03,810 perform similarity searches and retrieve relevant documents based on user queries. 64 00:06:05,720 --> 00:06:06,410 I am running it. 65 00:06:09,330 --> 00:06:12,620 Ok, if you get this error, use a keyword 66 00:06:12,630 --> 00:06:14,160 argument for indexName. 67 00:06:14,510 --> 00:06:16,720 So, indexName equals indexName. 68 00:06:17,010 --> 00:06:20,420 I am running it again and there are no errors. 69 00:06:21,430 --> 00:06:23,960 Let's take a look at the pinecone dashboard. 70 00:06:24,390 --> 00:06:29,780 We can see that the index contains 300 vectors. 71 00:06:31,670 --> 00:06:35,820 These vectors were inserted by this code. 72 00:06:36,730 --> 00:06:43,920 The 300 vectors represent the numeric representation of the 300 chunks of text 73 00:06:43,930 --> 00:06:44,940 we saw earlier. 74 00:06:46,750 --> 00:06:47,440 Take a look here. 75 00:06:47,930 --> 00:06:50,520 We have 300 chunks of text. 76 00:06:51,870 --> 00:06:54,940 If you want to specify a specific 77 00:06:54,950 --> 00:07:00,020 embedding model, add the model argument to OpenAI Embeddings. 78 00:07:00,710 --> 00:07:07,720 For example, the recommended embedding model is a text embedding 3, small or large. 79 00:07:13,890 --> 00:07:20,580 With this, we have successfully embedded the text into vectors and inserted them 80 00:07:20,590 --> 00:07:22,020 into a pinecone index. 81 00:07:22,810 --> 00:07:26,280 You will do this only once when split the 82 00:07:26,290 --> 00:07:30,800 document into chunks and embed those chunks into pinecone. 83 00:07:31,430 --> 00:07:37,720 Once you populated your index with your embeddings, you will just query the index. 84 00:07:38,830 --> 00:07:45,300 If you call pinecone .fromDocuments again, it will insert the vectors again 85 00:07:45,310 --> 00:07:47,280 and creating double entries. 86 00:07:48,610 --> 00:07:51,580 To load the vector store from an existing 87 00:07:51,590 --> 00:08:04,480 index, do vector store equals pinecone .fromExistingIndex and the arguments are 88 00:08:04,490 --> 00:08:20,940 indexName equals the name of the index, ChurchillSpeech, and embedding equals embeddings. 89 00:08:27,060 --> 00:08:31,090 This is loading the vector store from an existing index. 90 00:08:37,490 --> 00:08:37,880 Very well. 91 00:08:38,410 --> 00:08:39,600 Now, let's take a break. 92 00:08:40,070 --> 00:08:44,800 In the next video, we will demonstrate how to use similarity search to ask 93 00:08:44,810 --> 00:08:48,200 questions about the content of this custom document.