VectorStore/QA, learn more¶
NOTE: this uses Cassandra's "Vector Search" capability. Make sure you are connecting to a vector-enabled database for this demo.
In the previous Quickstart, you have created the index and at the same time added the corpus of text to it.
In most cases, these two operations happen at different times: besides, often new documents keep being ingested.
This notebook demonstrates further interactions you can have with a Cassandra Vector Store.
It is assumed you have run the "VectorStore/QA, Quickstart" notebook (so that the vector store is not empty)
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
The setup is similar to the one you saw:
from langchain.vectorstores.cassandra import Cassandra
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)
Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.
import os
from llm_choice import suggestLLMProvider
llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)
if llmProvider == 'GCP_VertexAI':
from langchain.llms import VertexAI
from langchain.embeddings import VertexAIEmbeddings
llm = VertexAI()
myEmbedding = VertexAIEmbeddings()
print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
os.environ['OPENAI_API_TYPE'] = 'open_ai'
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = OpenAI(temperature=0)
myEmbedding = OpenAIEmbeddings()
print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
os.environ['OPENAI_API_TYPE'] = 'azure'
os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
from langchain.llms import AzureOpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
print('LLM+embeddings from Azure OpenAI')
else:
raise ValueError('Unknown LLM provider.')
LLM+embeddings from Vertex AI
Re-use an existing Vector Store¶
Creating this Cassandra
vector store, it will re-connect with the existing data on DB.
In practice, you are loading an existing, pre-populated vector store for further usage.
(make sure you are using the very same embedding function every time! In fact, this is why we have a separate table for each embedding function, i.e. for each llmProvider
.)
myCassandraVStore = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_' + llmProvider,
)
You can then re-instantiate the index
from the vector store with:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)
and use it as you saw in the quickstart:
query = "Who is Luchesi?"
index.query(query, llm=llm)
'Luchesi is a wine critic.'
Further usage of the vector store¶
These are some of the ways you can query the store:
myCassandraVStore.similarity_search_with_score(
"Does anyone have a coughing fit?",
k=1,
)
[(Document(page_content='"Nitre," I replied. "How long have you had that cough?"\n\n"Ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh!\nugh! ugh!"\n\nMy poor friend found it impossible to reply for many minutes.\n\n"It is nothing," he said, at last.', metadata={'source': 'texts/amontillado.txt'}), 0.8610012756291656)]
Adding new documents¶
Start with a very off-topic question, to demonstrate that no relevant documents are found (yet).
Note: depending on the embedding function, you might still see some results, off-topic in practice, being found at this stage. In a full end-to-end QA session, however, these would likely be discarded by the LLM, which would presumably end up saying, "I don't know".
SPIDER_QUESTION = 'Compare Agelenidae and Lycosidae'
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=1,
score_threshold=0.8,
)
[(Document(page_content='"As you are engaged, I am on my way to Luchesi. If any one has a\ncritical turn, it is he. He will tell me--"\n\n"Luchesi cannot tell Amontillado from Sherry."\n\n"And yet some fools will have it that his taste is a match for your\nown."\n\n"Come, let us go."\n\n"Whither?"\n\n"To your vaults."\n\n"My friend, no; I will not impose upon your good nature. I perceive\nyou have an engagement. Luchesi--"', metadata={'source': 'texts/amontillado.txt'}), 0.8116434095595186)]
You can add a couple of relevant paragraphs to the index, using the add_texts
primitive:
spiderFacts = [
"""
The Agelenidae are a large family of spiders in the suborder Araneomorphae.
The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,
while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,
such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.
Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow
somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually
patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal
surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,
which assists in informally distinguishing it from similar-looking species.
""",
"""
Jumping spiders are a group of spiders that constitute the family Salticidae.
As of 2019, this family contained over 600 described genera and over 6,000 described species,
making it the largest family of spiders at 13% of all species.
Jumping spiders have some of the best vision among arthropods and use it
in courtship, hunting, and navigation.
Although they normally move unobtrusively and fairly slowly,
most species are capable of very agile jumps, notably when hunting,
but sometimes in response to sudden threats or crossing long gaps.
Both their book lungs and tracheal system are well-developed,
and they use both systems (bimodal breathing).
Jumping spiders are generally recognized by their eye pattern.
All jumping spiders have four pairs of eyes, with the anterior median pair
being particularly large.
""",
]
spiderMetadatas = [
{'source': 'wikipedia/Agelenidae'},
{'source': 'wikipedia/Salticidae'},
]
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_texts(
spiderFacts,
spiderMetadatas,
)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for spFact, spMetadata in zip(spiderFacts, spiderMetadatas):
thisId = myCassandraVStore.add_texts(
[spFact],
[spMetadata],
)[0]
print(thisId)
c35b450d84e94cef37de6a934da51860 03dcc418d50ee4c61bebaa92f6ee8005
Another way is to add a text through LangChain's Document
abstraction.
Note that, using one of LangChain's splitters, long input documents are made into (possibly overlapping) digestible chunks without much boilerplate:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)
lycoText = """
Wolf spiders are members of the family Lycosidae.
They are robust and agile hunters with excellent eyesight.
They live mostly in solitude, hunt alone, and usually do not spin webs.
Some are opportunistic hunters, pouncing upon prey as they
find it or chasing it over short distances;
others wait for passing prey in or near the mouth of a burrow.
Wolf spiders resemble nursery web spiders (family Pisauridae),
but wolf spiders carry their egg sacs by attaching them to their spinnerets,
while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.
Two of the wolf spider's eight eyes are large and prominent;
this distinguishes them from nursery web spiders,
whose eyes are all of roughly equal size.
This can also help distinguish them from the similar-looking grass spiders.
"""
lycoDocument = Document(
page_content=lycoText,
metadata={'source': 'wikipedia/Lycosidae'}
)
Use the splitter to "shred" the input document:
lycoDocs = mySplitter.transform_documents([lycoDocument])
lycoDocs
[Document(page_content='Wolf spiders are members of the family Lycosidae.\nThey are robust and agile hunters with excellent eyesight.\nThey live mostly in solitude, hunt alone, and usually do not spin webs.\nSome are opportunistic hunters, pouncing upon prey as they', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Some are opportunistic hunters, pouncing upon prey as they\nfind it or chasing it over short distances;\nothers wait for passing prey in or near the mouth of a burrow.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='this distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.\nThis can also help distinguish them from the similar-looking grass spiders.', metadata={'source': 'wikipedia/Lycosidae'})]
These are ready to be added to the index:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs:
thisId = myCassandraVStore.add_documents([lycoDoc])[0]
print(thisId)
078fb9e67d9ed9415d9ef7d1779f7e5d 32acf980292dac94d9e0cdab6a1f05b5 c0a279086e100f559b2fc59213312076 4a6305de53b6adec0f5e164c1f3856a0 902d340a12bfbb5756c15296d4a7bb49
Querying the store again¶
Time to repeat the question:
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=3,
score_threshold=0.8,
)
[(Document(page_content='\n The Agelenidae are a large family of spiders in the suborder Araneomorphae.\n The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,\n while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,\n such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.\n Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow\n somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually\n patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal\n surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,\n which assists in informally distinguishing it from similar-looking species.\n ', metadata={'source': 'wikipedia/Agelenidae'}), 0.8497665037896931), (Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), 0.8337309247681222), (Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), 0.8216857071410616)]
Item removal and expiration¶
Time-To-Live (TTL)¶
If you provide a TTL value when creating the store, every entry will expire away a certain time after its insertion:
myCassandraVStoreWithTTL = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_shortlived_' + llmProvider,
ttl_seconds=120,
)
The following two documents will be available for two minutes.
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStoreWithTTL.add_documents(lycoDocs[0:2])
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[0:2]:
thisId = myCassandraVStoreWithTTL.add_documents([lycoDoc])[0]
print(thisId)
078fb9e67d9ed9415d9ef7d1779f7e5d 32acf980292dac94d9e0cdab6a1f05b5
Alternatively, for a finer control of the time-to-live, you can specify it at insertion time -- which would anyway have precedence over the store-level definition. So, these documents will survive for twenty seconds:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs[2:], ttl_seconds=20)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[2:]:
thisId = myCassandraVStore.add_documents([lycoDoc], ttl_seconds=20)[0]
print(thisId)
c0a279086e100f559b2fc59213312076 4a6305de53b6adec0f5e164c1f3856a0 902d340a12bfbb5756c15296d4a7bb49
Manual removal of entries¶
You can delete individual documents from the store.
However, you first need to retrieve their identifier with a similarity search. The following method returns a list of matching 3-tuples, whose last item is the id of the document:
spiderDocIds = []
for doc, score, docId in myCassandraVStore.similarity_search_with_score_id('Compare Agelenidae and Lycosidae'):
print(f' * [{score:.3f}] "{doc.page_content[:32].strip()}..." ({docId})')
spiderDocIds.append(docId)
* [0.850] "The Agelenidae are a large..." (c35b450d84e94cef37de6a934da51860) * [0.834] "while the Pisauridae carry their..." (4a6305de53b6adec0f5e164c1f3856a0) * [0.822] "Wolf spiders resemble nursery we..." (c0a279086e100f559b2fc59213312076) * [0.813] "Wolf spiders are members of the..." (078fb9e67d9ed9415d9ef7d1779f7e5d)
At this point you can perform the actual document deletion:
for spiderDocId in spiderDocIds:
myCassandraVStore.delete_by_document_id(spiderDocId)
The last method to remove entries from the store is demonstrated next.
Cleanup¶
You're done.
In order to leave the index empty for the next demo run, you may want to clean the index (i.e. empty the table on DB).
Just don't take this operation lightly in production!
myCassandraVStore.clear()