Pebblo SafeRetriever
provides visibility and enforcement for Semantic, Entity, and Identity of context retrieval for RAG applications.
Identity Enforcement needs two parts:
- Identity aware data retrieval and vector DB ingestion
- Identity enforcement on the retrieval chain
Here is the sample code for GoogleDriveLoader
with load_auth
parameter set to True
.
self.loader = PebbloSafeLoader(
GoogleDriveLoader(
folder_id=folder_id,
token_path="./google_token.json",
recursive=True,
file_loader_cls=UnstructuredFileIOLoader,
file_loader_kwargs={"mode": "elements"},
load_auth=True,
),
name=self.app_name, # App name (Mandatory)
owner="Joe Smith", # Owner (Optional)
description="Identity enabled SafeLoader and SafeRetrival app using Pebblo", # Description (Optional)
)
Here is the sample code for Pebblo SafeRetriever
with authorized_groups
from the user accessing RAG application passed in auth_context
from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput
retrieval_chain = PebbloRetrievalQA.from_chain_type(
llm=self.llm,
app_name=self.app_name,
owner="Joe Smith",
description="Identity and Semantic filtering using PebbloSafeLoader, and PebbloRetrievalQA",
chain_type="stuff",
retriever=self.vectordb.as_retriever(),
verbose=True,
)
auth_context = {
"authorized_identities": [
"joe@acme.io",
"hr-group@acme.io",
"us-employees-group@acme.io",
]
}
auth_context = AuthContext(**auth_context)
chain_input = ChainInput(query=question, auth_context=auth_context)
answer = retrieval_chain.invoke(chain_input.dict())
This solution requires the following two private LangChain packages:
- langchain
- langchain-community
The above two packages with GoogleDriveLoader
with authorized identities and PebbloRetrievalQA
chain with identity enforcement can be installed using the following steps.
$ git clone -b pebblo_identity_saferetriever https://github.com/daxa-ai/langchain.git
$ cd langchain
# Install updated langchain-community package that has all document loaders,
# including GoogleDriveLoader with authorized-identities feature
$ (cd libs/community; pip install .)
# Install updated langchain package that has the new PebbloRetreivalQA chain
$ (cd libs/langchain; pip install .)
Here are the two corresponding PRs in the LangChain for this feature:
- community: add authorization identities to GoogleDriveLoader #18813 langchain-ai/langchain#18813
- langchain: add PebbloRetrievalQA chain with Identity & Semantic enforcement #20641 langchain-ai/langchain#20641
GoogleDriveLoader comes with some prerequisites. Please refer this section or follow below steps:
-
Create a Google Cloud project or use an existing project
-
Enable the Google Drive API
-
Refer Authorize credentials for desktop app or follow below section to download
credentials.json
.a. save
credentials.json
file using above step at~/.credentials/credentials.json
path.b. put the absolute path of crdentials.json file in a
GOOGLE_APPLICATION_CREDENTIALS
environment variable.export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.credentials/credentials.json"
-
Customize OAuth consent screen for your project: https://console.cloud.google.com/apis/credentials/consent
To authenticate end users and access user data in your app, you need to create one or more OAuth 2.0 Client IDs. A client ID is used to identify a single app to Google's OAuth servers. If your app runs on multiple platforms, you must create a separate client ID for each platform.
- In the Google Cloud console, go to Menu menu > APIs & Services > Credentials. Go to Credentials
- Click Create Credentials > OAuth client ID.
- Click Application type > Desktop app.
- In the Name field, type a name for the credential. This name is only shown in the Google Cloud console.
- Click Create. The OAuth client created screen appears, showing your new Client ID and Client secret.
- Click OK. The newly created credential appears under OAuth 2.0 Client IDs.
- Save the downloaded JSON file as credentials.json, and move the file
~/.credentials/credentials.json
path.
Download and save credentials.json for your GCP project at ~/.credentials/credentials.json
- Setup virtual env and install
langchain/identity-rag/requirements.txt
.
pip install -r langchain/identity-rag/requirements.txt
- Run the application
python3 langchain/identity-rag/pebblo_identity_rag.py
- It will need following inputs:
- For the ingestion user:
- Admin email address : For listing groups to know the identity.
- service-account.json path : Service account credentials file for your google account with enough permissions.
- Folder Id : Folder id where the documents to be loaded are stored.
- End user email address, against which the identity would be matched.
- Prompt by the end user.
- For the ingestion user:
Based on all the inputs, it will load the data from given Google Drive folder and based on the input prompt and it will respond according to the user level permissions for that user.