Training ChatGPT on Custom Data: How to Make This Magic AI Tool Even More Powerful
ChatGPT is one of the most exciting artificial intelligence tools that exists nowadays. Imagine having a versatile AI companion at your fingertips, one capable of crafting captivating blog posts, engaging social media content, and even eloquent product descriptions for your website. But ChatGPT's prowess extends far beyond the realm of text generation: it's a true business catalyst.
ChatGPT doesn't halt at prose and strategy. It has a knack for coding too, effortlessly sketching out the backbone of a web application tailored to your services. The possibilities seem limitless, right?
Almost. As with any star, even ChatGPT has its boundaries. Post-September 2021 developments remain uncharted waters, and when it comes to hyper-specific queries tied intricately to your enterprise, ChatGPT might not be your oracle by itself.
But worry not, for within the confines of this article lies the key to unlock ChatGPT's full potential. We unveil a solution that empowers you to infuse ChatGPT with bespoke information, making it a powerhouse of industry-specific wisdom. Harness the fusion of AI brilliance and tailored insights as ChatGPT evolves from a tool to your indispensable partner in navigating business challenges.
Discover how to supercharge ChatGPT's capabilities and make it an intrinsic part of your business success story. Let's explore the journey of enhancing ChatGPT's knowledge base and transforming it into your ultimate business ally.
What is ChatGPT?
ChatGPT is a chatbot based on a large language model created by OpenAI and issued in November 2022. A large language model is a deep learning model to deal with human languages.
Although for machines, text is no more than just a sequence of symbols, after applying AI tools, machines start to literally understand, analyze, and make conclusions from texts.
There are many different large language models. Some of them are AlexaTM (issued by Amazon), Minerva (created by Google), LLaMA (issued by Meta) and, of course, GPT models (GPT-2, GPT-3, GPT-4). The GPT-3.5 model is used in ChatGPT chatbot. The source of data for this model is websites, textbooks, and articles that existed before September 2021. This information is used for training the model, and as a result one can get answers to many kinds of human questions.
What could ChatGPT do?
ChatGPT is capable of performing many different tasks.
Writing programming code
This AI tool is good at writing simple code in many existing languages: Java, JavaScript, Python, SQL, etc. The only thing to do is to formulate your requirements as precisely as you can. Moreover, the chatbot can describe steps for creating simple web applications or even debug existing programming code.
Writing any text content
This artificial helper is able to produce articles or content for blogs. You don’t need me anymore to create this article; ChatGPT can do it easily. You should just specify the topic, language style, length, etc. It also can create posts for social media and emails for your friends or customers. Even writing a poem or a song is not a problem for this magic artificial genius. P.S.: Don’t worry, this article was not written by ChatGPT.
Analyzing, summarizing, retelling texts.
If you are bored reading a big book for tomorrow's exam, just keep calm, because you have a magic assistant. You can ask for a summary of the necessary content with the main points underlined.
Improving your marketing setup
ChatGPT is not a bad marketer (sorry, Integrio marketing team). It can generate ideas for your future promotion campaign, create slogans and content, and plan for it. SEO optimized text is not a problem for this AI tool either. It can also write product descriptions and product titles for an online market.
Personal assistance
Of course ChatGPT is not a doctor, psychologist, or fitness trainer, so you must be careful implementing its ideas, but still it can give very useful basic pieces of advice pertaining to your health, training program, or improving your mood.
Many more
ChatGPT can solve mathematical problems, play games, help you find a job by writing a CV and cover letters, explain difficult topics, suggest suitable places to visit and books to read, and so on.
Isn't it too good to be true?
Sure, it is not a real person, so the answers sometimes are flawed. In general the main problems are:
Security and privacy issues
Unfortunately, nobody can guarantee that the information provided will be known just by you and your friend ChatGPT, so you should be very careful when providing private information.
Wrong answers
Like all of us, ChatGPT is not perfect and sometimes can generate wrong answers. So you should be careful relying on its answers.
Absence of specific information about your service or company
As mentioned above, ChatGPT was trained on websites, textbooks, and articles, so it does not aid you with questions related to your business. But fortunately, this problem can be solved, and we will describe an approach which supplements ChatGPT data with the necessary specific information.
Feed ChatGPT your own data
In this section we will use Python programing language and the langchain library to train the model on the specific data. This process consists of several steps:
Writing a text with information to utilize
In this step, we should provide detailed data to be used. It can be created in different formats, such as .txt, .pdf, .html, etc. Let’s assume that all our texts are in the .txt format and that they are located in the folder “data.” Then we can load the information by executing this code:
from langchain.document_loaders import DirectoryLoader loader = DirectoryLoader('data', glob='**/*.txt') documents = loader.load()
Splitting your data into smaller pieces
This step helps to optimize the training process, because for training we will exploit not all text, but only the most suitable parts. The maximum number of characters in one piece is restricted by the parameter chunk_size. For our case, this parameter equals 1000. The splitting can be done by the following code:
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents)
Text vectorization (embedding), creating a vector store, retrieving appropriate embeddings
For using large language models, all text data should be converted into a set of vectors, since artificial intelligence algorithms are able to work only with vectorized data. For our case, we use OpenAIEmbeddings and FAISS classes to convert text to vectors and create a vector store. Alternatively one can utilize other possible tools. After these procedures, we have to retrieve the pieces most appropriate for the answer from the vector store by applying the method as_retriever(). Such steps can be written in the form
from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS embeddings = OpenAIEmbeddings(openai_api_key=key) docsearch = FAISS.from_documents(texts, embeddings) retriever = docsearch.as_retriever()
Large language model selection
In this step, we should select a large language model to use. In this example we exploit the text-davinci-003 model (the default) from the OpenAI class of langchain library. The corresponding code is
from langchain import OpenAI llm = OpenAI(openai_api_key=key, temperature=temperature)
Answering a question
The final step is answering the question, which can be done by executing the code
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
All steps in one function
Congratulations! We trained ChatGPT on our specific data and it is aware of everything we want it to know. We just need to ask a question to receive a specific answer. Enjoy it!
For your convenience all steps are gathered in one function create_gpt_output(question, temperature, key). It takes three parameters:
question is a question to answer.
temperature is a parameter between 0 and 1 which indicates how answers should be close to the provided text. The lower this parameter the more closer answers to the given text. If you want ChatGPT to use its native database then set this parameter to 1. For our example its value equals 0.01 since we want the model to use our text but not its native ones.
key is an OpenAI key, which can be generated on the website https://platform.openai.com in the section API keys (https://platform.openai.com/account/api-keys). For obtaining a key one should register an account. Each new user has $5 on the account.
So, have fun: write your text and play with ChatGPT. The code is below.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
openai_api_key = "Put your OpenAI key here"
def create_gpt_output(question, temperature=0.01, key=openai_api_key):
loader = DirectoryLoader('data', glob='**/*.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(openai_api_key=key)
docsearch = FAISS.from_documents(texts, embeddings)
retriever = docsearch.as_retriever()
llm = OpenAI(openai_api_key=key, temperature=temperature)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
query = str(question)
result = qa({"query": query})
return result['result']
Information about innovative technologies and practices can give you a competitive advantage, and personalized solutions will meet your unique needs.
Contact us