How All These ’ChatGPT With Your Data’ Work

Created time

Oct 4, 2023 08:03 AM

Author

1.Text Extraction: Laying the Foundation

When you upload a document to one of these tools, the first thing they will do is to extract all the text. This can be done by scraping the text from a website or directly extracting it from a PDF or other text file. By extracting all textual content, the tool can then analyze and process the information for later use.

2. Chunking: Dividing the Text

The next step is to divide the extracted text into smaller segments, known as "chunks." Each chunk typically contains a predefined number of characters, such as 200 or 1,500. To maintain context between adjacent chunks, the tool will intentionally overlap some characters (usually 100 or 200) between them. The length of your ‘chunk’ varies on your use case but is generally comparable to the length of input and response you are using e.g. if you are building a chatbot you may want to use shorter chunks to reflect the way people use chats more.

3. Embedding: Generating Descriptors

Once the text is divided into chunks, embedding takes place. Embedding (we use OpenAI’s Embedding API) is a process whereby you generate multiple numerical ‘descriptors’ for each chunk, providing a comprehensive ‘understanding’ of the text within it.

While we don’t know what each descriptor really represents, you could imagine, for example, a single descriptor for a chunk being a measure of how funny it is and another being a measure of formality, and so on (until you have described that chunk in 1,500 or more different ways).

Every chunk is then iterated through and analyzed using the same descriptors and so you end up with a ‘vector’ (a long list of numbers) that is kind of like a fingerprint for that chunk of text.

4. Vector Database: Cataloging the Data

The descriptors (vectors) generated for each chunk. are then logged into a ‘vector database’ so they can be searched at a later date. So effectively all of your chunks of data are stored

5. Asking a Question: Input Matters

When you pose a question to the tool, it's converted into a vector (or descriptor) in a similar manner to the text chunks. This ensures that the question can be accurately compared to the existing database of chunk descriptors.

6. Semantic Search: Finding the Perfect Match

With the question properly formatted, a ‘semantic search’ is conducted to find text chunks with characteristics similar to the asked question. By comparing the question's vector to those of the chunks, the tool identifies the most closely related text segments, which can then be retrieved for further processing.

7. Generating a Prompt: Crafting the Answer

The retrieved text chunks are used as part of a prompt that guides the tool in crafting an answer to the question. The prompt essentially instructs the tool to provide an answer using the context derived from the related text chunks. Additional instructions can be included, such as emphasizing truthfulness and some pre-processing may be applied to the question to improve search results and enhance answer accuracy.

8. Presenting the Answer: The Final Result

Finally, the tool generates an answer based on the prompt and presents it to the user. In some cases (like with My AskAI), the tool may also display the text chunks or references that contributed to the answer, providing transparency and allowing users to better understand the context and sources behind the information provided.

Conclusion

‘ChatGPT with your data’ tools leverage advanced technology and a series of carefully orchestrated steps to provide users with accurate and contextually relevant answers to their questions. By understanding the processes behind these tools, including text extraction, chunking, embedding, semantic search, and more, hopefully, you can better appreciate the capabilities of these innovative solutions and utilize them to their full potential. We expect as these tools continue to develop and improve, their ability to streamline and enhance our interactions with digital documents will only grow stronger.

How All These ’ChatGPT With Your Data’ Work

1.Text Extraction: Laying the Foundation

2. Chunking: Dividing the Text

3. Embedding: Generating Descriptors

4. Vector Database: Cataloging the Data

5. Asking a Question: Input Matters

6. Semantic Search: Finding the Perfect Match

7. Generating a Prompt: Crafting the Answer

8. Presenting the Answer: The Final Result

Conclusion

Start using AI customer support in your business today

Related posts

ADO (AI Document Optimization) - the New SEO?

The Pivotal Role of User Experience and Distribution in AI

Are AI chatbots the future of UX?

How All These ’ChatGPT With Your Data’ Work

1.Text Extraction: Laying the Foundation

2. Chunking: Dividing the Text

3. Embedding: Generating Descriptors

4. Vector Database: Cataloging the Data

5. Asking a Question: Input Matters

6. Semantic Search: Finding the Perfect Match

7. Generating a Prompt: Crafting the Answer

8. Presenting the Answer: The Final Result

Conclusion

Start using AI customer support in your business today

Related posts

ADO (AI Document Optimization) - the New SEO?

The Pivotal Role of User Experience and Distribution in AI

Are AI chatbots the future of UX?

Join 1000's of SaaS businesses using AI to improve their customer support