1
00:00:00,700 --> 00:00:01,690
Welcome back.

2
00:00:02,120 --> 00:00:05,050
In this video, we'll embed each chunk of

3
00:00:05,060 --> 00:00:09,710
text into numeric vectors and insert them
into a Pinecone index.

4
00:00:10,360 --> 00:00:15,410
I'm importing the necessary libraries and
initializing the Pinecone client.

5
00:00:16,480 --> 00:00:27,420
Import Pinecone from LangchainCommunity
.VectorStores import the Pinecone class

6
00:00:30,220 --> 00:00:32,950
and I'm initializing the client.

7
00:00:35,840 --> 00:00:40,470
Pay attention, there are two classes

8
00:00:40,480 --> 00:00:46,250
named Pinecone, one in the Pinecone
module and another one in

9
00:00:46,260 --> 00:00:48,490
LangchainCommunity .VectorStores.

10
00:00:51,010 --> 00:00:53,680
The next step is to create a Pinecone index.

11
00:00:54,850 --> 00:00:58,300
You can also use an existing index if you want.

12
00:00:59,170 --> 00:01:01,400
Note that if you are currently using the

13
00:01:01,410 --> 00:01:06,980
free plan, you are limited to one index
and one project and if you already have

14
00:01:06,990 --> 00:01:10,360
one, you will get an error if you try to
create a second one.

15
00:01:11,170 --> 00:01:16,260
So it's a good idea to delete all
existing indexes before creating a new one.

16
00:01:16,890 --> 00:01:30,600
For i in PC .listIndex .names, PC
.deleteIndex of i.

17
00:01:34,930 --> 00:01:40,180
And it's always a good idea to display a
message for the user.

18
00:01:41,630 --> 00:01:49,450
Deleting all indexes and print, done.

19
00:01:51,810 --> 00:01:54,260
I'm also adding end equals empty string.

20
00:01:59,090 --> 00:01:59,300
Good.

21
00:01:59,950 --> 00:02:02,420
I'm creating a new index for these

22
00:02:02,430 --> 00:02:04,980
embeddings called ChurchillSpeech.

23
00:02:06,970 --> 00:02:10,320
indexName equals ChurchillSpeech.

24
00:02:16,420 --> 00:02:28,820
If indexName not in PC .listIndex .names,
if the index does not exist, I will

25
00:02:28,830 --> 00:02:29,320
create it.

26
00:02:32,490 --> 00:02:39,020
Creating index and indexName in curly braces.

27
00:02:45,300 --> 00:02:56,860
And PC .createIndex and the arguments are
name equals indexName, dimension equals

28
00:02:56,870 --> 00:03:19,700
1536, metric equals cosine and the spec
equals pinecone .podSpec of environment

29
00:03:19,710 --> 00:03:23,360
equals gcpStarter.

30
00:03:28,120 --> 00:03:31,630
After creating the index, I am printing done.

31
00:03:33,200 --> 00:03:34,410
I am running the code.

32
00:03:36,520 --> 00:03:39,870
Note that indexNames must consist of

33
00:03:39,880 --> 00:03:46,090
lowercase alphanumeric characters or
dashes and must start and end with an

34
00:03:46,100 --> 00:03:47,170
alphanumeric character.

35
00:03:49,500 --> 00:03:52,590
Next, we will upload the vectors to

36
00:03:52,600 --> 00:03:54,090
pinecone using langchain.

37
00:03:55,060 --> 00:03:57,690
We will call pinecone .fromDocuments.

38
00:04:00,500 --> 00:04:02,330
pinecone .fromDocuments.

39
00:04:03,960 --> 00:04:07,110
This method takes three arguments,

40
00:04:07,960 --> 00:04:14,040
chunks, embeddings and the indexName.

41
00:04:17,600 --> 00:04:21,250
Chunks is a list of text documents that

42
00:04:21,260 --> 00:04:25,890
have been obtained by calling the
recursive character text splitter method.

43
00:04:26,320 --> 00:04:31,850
These smaller chunks will be indexed in
pinecone to make it easier to search and

44
00:04:31,860 --> 00:04:35,210
retrieve relevant information later on.

45
00:04:35,920 --> 00:04:40,110
Embeddings is an instance of the OpenAI

46
00:04:40,120 --> 00:04:48,050
Embeddings class which is responsible for
converting text data into embeddings

47
00:04:48,060 --> 00:04:50,690
using OpenAI's embedding model.

48
00:04:51,360 --> 00:04:54,970
These embeddings will be stored in the

49
00:04:54,980 --> 00:04:57,750
pinecone index and used for similarity search.

50
00:04:58,460 --> 00:05:00,950
And the indexName is a string

51
00:05:00,960 --> 00:05:04,270
representing the name of the pinecone index.

52
00:05:04,840 --> 00:05:07,790
This name is used to identify the index

53
00:05:07,800 --> 00:05:11,490
in pinecones database and must already exist.

54
00:05:12,380 --> 00:05:15,610
We have defined all these objects earlier.

55
00:05:18,030 --> 00:05:24,110
Chunks, embeddings and indexName.

56
00:05:25,340 --> 00:05:29,710
The pinecone .fromDocuments method also

57
00:05:29,720 --> 00:05:34,670
returns a vectorStore object initialized
from documents and embeddings.

58
00:05:37,600 --> 00:05:39,190
vectorStore equals.

59
00:05:41,060 --> 00:05:44,690
In a nutshell, this method processes the

60
00:05:44,700 --> 00:05:50,930
input documents, generates embeddings
using the provided OpenAI Embeddings

61
00:05:50,940 --> 00:05:54,430
instance and returns a new pinecone vectorStore.

62
00:05:54,960 --> 00:05:58,630
The resulting vectorStore object can

63
00:05:58,640 --> 00:06:03,810
perform similarity searches and retrieve
relevant documents based on user queries.

64
00:06:05,720 --> 00:06:06,410
I am running it.

65
00:06:09,330 --> 00:06:12,620
Ok, if you get this error, use a keyword

66
00:06:12,630 --> 00:06:14,160
argument for indexName.

67
00:06:14,510 --> 00:06:16,720
So, indexName equals indexName.

68
00:06:17,010 --> 00:06:20,420
I am running it again and there are no errors.

69
00:06:21,430 --> 00:06:23,960
Let's take a look at the pinecone dashboard.

70
00:06:24,390 --> 00:06:29,780
We can see that the index contains 300 vectors.

71
00:06:31,670 --> 00:06:35,820
These vectors were inserted by this code.

72
00:06:36,730 --> 00:06:43,920
The 300 vectors represent the numeric
representation of the 300 chunks of text

73
00:06:43,930 --> 00:06:44,940
we saw earlier.

74
00:06:46,750 --> 00:06:47,440
Take a look here.

75
00:06:47,930 --> 00:06:50,520
We have 300 chunks of text.

76
00:06:51,870 --> 00:06:54,940
If you want to specify a specific

77
00:06:54,950 --> 00:07:00,020
embedding model, add the model argument
to OpenAI Embeddings.

78
00:07:00,710 --> 00:07:07,720
For example, the recommended embedding
model is a text embedding 3, small or large.

79
00:07:13,890 --> 00:07:20,580
With this, we have successfully embedded
the text into vectors and inserted them

80
00:07:20,590 --> 00:07:22,020
into a pinecone index.

81
00:07:22,810 --> 00:07:26,280
You will do this only once when split the

82
00:07:26,290 --> 00:07:30,800
document into chunks and embed those
chunks into pinecone.

83
00:07:31,430 --> 00:07:37,720
Once you populated your index with your
embeddings, you will just query the index.

84
00:07:38,830 --> 00:07:45,300
If you call pinecone .fromDocuments
again, it will insert the vectors again

85
00:07:45,310 --> 00:07:47,280
and creating double entries.

86
00:07:48,610 --> 00:07:51,580
To load the vector store from an existing

87
00:07:51,590 --> 00:08:04,480
index, do vector store equals pinecone
.fromExistingIndex and the arguments are

88
00:08:04,490 --> 00:08:20,940
indexName equals the name of the index,
ChurchillSpeech, and embedding equals embeddings.

89
00:08:27,060 --> 00:08:31,090
This is loading the vector store from an
existing index.

90
00:08:37,490 --> 00:08:37,880
Very well.

91
00:08:38,410 --> 00:08:39,600
Now, let's take a break.

92
00:08:40,070 --> 00:08:44,800
In the next video, we will demonstrate
how to use similarity search to ask

93
00:08:44,810 --> 00:08:48,200
questions about the content of this
custom document.