As more businesses and individuals embrace digital documentation, the need for tools that facilitate communication with digital documents has risen. In this blog post, we delve into the inner workings of these innovative tools, uncovering the technology that powers them.
Mike is an experienced Product Manager who focuses on all the “non-development” areas of My AskAI, from finance and customer success to product design, copywriting, testing and more.
Digital document interaction tools have become increasingly popular recently, offering users the ability to ask questions and receive contextually relevant answers directly from the documents themselves. To help you understand the technology and processes behind these tools, we'll walk you through each step, from text extraction to presenting the final answer.
1.Text Extraction: Laying the Foundation
When you upload a document to one of these tools, the first thing they will do is to extract all the text. This can be done by scraping the text from a website or directly extracting it from a PDF or other text file. By extracting all textual content, the tool can then analyze and process the information for later use.
2. Chunking: Dividing the Text
The next step is to divide the extracted text into smaller segments, known as "chunks." Each chunk typically contains a predefined number of characters, such as 200 or 1,500. To maintain context between adjacent chunks, the tool will intentionally overlap some characters (usually 100 or 200) between them. The length of your ‘chunk’ varies on your use case but is generally comparable to the length of input and response you are using e.g. if you are building a chatbot you may want to use shorter chunks to reflect the way people use chats more.
3. Embedding: Generating Descriptors
Once the text is divided into chunks, embedding takes place. Embedding (we use OpenAI’s Embedding API) is a process whereby you generate multiple numerical ‘descriptors’ for each chunk, providing a comprehensive ‘understanding’ of the text within it.
While we don’t know what each descriptor really represents, you could imagine, for example, a single descriptor for a chunk being a measure of how funny it is and another being a measure of formality, and so on (until you have described that chunk in 1,500 or more different ways).
Every chunk is then iterated through and analyzed using the same descriptors and so you end up with a ‘vector’ (a long list of numbers) that is kind of like a fingerprint for that chunk of text.
4. Vector Database: Cataloging the Data
The descriptors (vectors) generated for each chunk. are then logged into a ‘vector database’ so they can be searched at a later date. So effectively all of your chunks of data are stored
5. Asking a Question: Input Matters
When you pose a question to the tool, it's converted into a vector (or descriptor) in a similar manner to the text chunks. This ensures that the question can be accurately compared to the existing database of chunk descriptors.
6. Semantic Search: Finding the Perfect Match
With the question properly formatted, a ‘semantic search’ is conducted to find text chunks with characteristics similar to the asked question. By comparing the question's vector to those of the chunks, the tool identifies the most closely related text segments, which can then be retrieved for further processing.
7. Generating a Prompt: Crafting the Answer
The retrieved text chunks are used as part of a prompt that guides the tool in crafting an answer to the question. The prompt essentially instructs the tool to provide an answer using the context derived from the related text chunks. Additional instructions can be included, such as emphasizing truthfulness and some pre-processing may be applied to the question to improve search results and enhance answer accuracy.
8. Presenting the Answer: The Final Result
Finally, the tool generates an answer based on the prompt and presents it to the user. In some cases (like with My AskAI), the tool may also display the text chunks or references that contributed to the answer, providing transparency and allowing users to better understand the context and sources behind the information provided.
Conclusion
‘ChatGPT with your data’ tools leverage advanced technology and a series of carefully orchestrated steps to provide users with accurate and contextually relevant answers to their questions. By understanding the processes behind these tools, including text extraction, chunking, embedding, semantic search, and more, hopefully, you can better appreciate the capabilities of these innovative solutions and utilize them to their full potential. We expect as these tools continue to develop and improve, their ability to streamline and enhance our interactions with digital documents will only grow stronger.
Mike is an experienced Product Manager who focuses on all the “non-development” areas of My AskAI, from finance and customer success to product design, copywriting, testing and more.