Doctran: language translation
Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.
However, it can still be useful to use an LLM to translate documents into other languages before vectorizing them. This is especially helpful when users are expected to query the knowledge base in different languages, or when state-of-the-art embedding models are not available for a given language.
We can accomplish this using the Doctran library, which uses OpenAI's function calling feature to translate documents between languages.
%pip install --upgrade --quiet doctran
Note: you may need to restart the kernel to use updated packages.
from langchain_community.document_transformers import DoctranTextTranslator
from langchain_core.documents import Document
API Reference:DoctranTextTranslator | Document
from dotenv import load_dotenv
load_dotenv()
True