Written by: Atharva Kulkarni
Date: 27/11/23
As you all AI enthusiasts saw on the recent OpenAI DevDay 2023 OpenAI now allows you to make custom assistants or GPTs by providing the model with data you provide to the assistant. I took advantage of this and solved a major hurdle.
If you wish to read my entire journey regarding how I cracked this then continue reading.
I think I am telling you the story all incorrectly so let’s start from the beginning. I’ve been working on a module at my workplace for a long time. Can’t reveal much due to the hundreds of NDAs that I have signed. The important bit is that the module required us to integrate a third-party API which is written in old SOAP - WSDL in XML and was almost more than 500 pages.
We attempted to integrate that API last year, but we couldn't because of the reference of elements to each other in the documentation. The only way to crack it is to go through the entire 500 pages and keep track of everything at once.
After India’s loss to Australia which I promise I won’t be mentioning again, I picked this module up as a way to get myself to “move on” and start focusing on work.
This time around I had a lot of AI apps compared to last year which I tried to use this time and make an expert in the documentation of that third-party API. None of the LLMs or AI-based tools worked due to one simple problem. They had only the context of a single page or single API which I entered into it. I needed it to have a complete context of the entire documentation. That’s when I decided to crack this problem once and for all.
I went to the main table of contents page of the documentation then I clicked on the view page source button and took out all the links from that webpage manually like a dumb hardworking horse.
Then created a .txt file and dropped all the links there. Borrowed a friend’s Windows computer and used his PowerShell to run a PowerShell script and converted all those links into PDFs. Then used iLovePDF.com to convert those into a single PDF.
Then I uploaded the PDF to OpenAI assistants and created the assistant which answered all my queries for that documentation like an expert.
But then I found a better way to do it by crawling the pages automatically and making a JSON out of the entire site and training the model using the JSON instead of the PDFs. Apparently, it does a better job.
Conclusion: No matter how much you try to make your job simpler, there will be an even more simpler way to do it which you will eventually learn.
Now Tutorial time.
If you have a sitemap of the documentation/platform/site which you wish to make GPT out of then follow this method:
Clone the repo using: git clone <https://github.com/builderio/gpt-crawler>
Install all the dependencies: npm i
Open config.ts file
Configure it as per your needs:
export const defaultConfig: Config = { url: "<https://www.builder.io/c/docs/developers>", match: "<https://www.builder.io/c/docs/**>", selector: .docs-builder-container, maxPagesToCrawl: 50, outputFileName: "output.json", };
Change the URL to whatever documentation URL you wish you crawl and set the match criteria for the sitemap. Selector will help in identifying objects to pull during the crawl and maxPagesToCrawl will limit the number of pages. Hit save!
Now run your crawler: npm start, Everything happens headlessly.
Tada! Now you have the entire doc in a JSON format in your output.json file
If you do not have a sitemap and there are random or multiple web pages not related to each other which is your reference point for your model then follow this method instead:
Get a Windows Machine first or be smart enough to transpile my PowerShell into a bash or zsh script.
Create a .txt file of all the links that you wish to crawl or make a GPT out of. Each line must only have a single and valid URL.
Now open PowerShell with administrator rights
Enter the command: Set-ExecutionPolicy Unrestricted
Save the following script on your computer with an extension .ps1 .
$sourceFile = " " # the source file containing the URLs you want to convert $destFolder = " " # converted PDFs will be saved here. Folder has to exist. Don't forget to make sure that this path must end with "/" $num = 1 foreach ($link in [System.IO.File]::ReadLines($sourceFile)) { $outfile = $num.ToString() + '.pdf' $outputPath = Join-Path -Path $destFolder -ChildPath $outfile & 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe' --headless --print-to-pdf="$outputPath" "$link" Start-Sleep -Seconds 3 $num++ }
Now configure your source file to be the path of your .txt file that you made with all your links earlier. While your destination folder path needs to be configured to an empty folder. This is the folder where all your PDFs shall drop. Hit save!
Make sure your computer has Chrome installed on the same path as the one mentioned in the PowerShell script
Now run it by taking your PowerShell to that folder where you have your PowerShell script and running the command: ./YOUR_POWERSHELL_SCRIPT_FILENAME.ps1, Everything will happen headlessly.
Tada! Now you have all the web pages into PDFs in the destination folder.
Merge all the PDFs using iLovePDF.com
Now coming to the most important step. Training your model:
You could do this in ChatGPT and make a custom GPT by:
Going to: https://chat.openai.com/
Click your profile photo in the bottom left corner of your screen
Clicking on "My GPTs" in the menu there
Clicking on "Create a GPT"
Clicking on the "Configure" button
Now under the "Knowledge" section click on "Upload a file" and upload the file (PDF or JSON whichever you created)
Or you could make an assistant on the OpenAI Platform by:
Going to: https://platform.openai.com/assistants
Clicking on the "Create" button
And finally, then click on the "upload" button and upload the file (PDF or JSON whichever you created)
Bingo, now you have a Custom GPT trained on the site/document that you were looking for. Ask as many questions as you like and chances are that it has all the answers. It worked for me. Maybe it will work for you as well.
I have mentioned the sources and references from where I gained this knowledge. Feel free to go to their sites and show support for the solutions they have introduced to the world. Also, let me know if this helps you in any way, shape or form, I’ll send you my address with a request to send me a small KitKat!
I hope this helps you!
Sources/References:
Let Love, Peace and Respect Prevail; this is ServerLord signing out!