OpenAI API Calls – Is that the future of programming ?

What could I do in a few hours over a weekend with the latest OpenAI APIs ? That was the questions I asked myself a few days ago. Let’s see how it turned out.

I’ve always loved programming, but running a business takes a lot of time and effort in areas that are outside my comfort zone. Planning new projects, thinking about the meaning of life, strategizing for growth, and solving problems that often revolve more around humans than technology.

Often, when I want to return to my ‘roots’, I take time off to do some programming… In Python these days, it’s quite relaxing and offers almost instant gratification.

I had never tried OpenAI APIs before this. I had just poked around in their API playground a couple of times…

First Challenge : OpenAI Text-to-Speech

As I take my ChatGPT prompting skills to new heights, I took on a challenge sent by my dad to re-write some of his compositions using the AI assistance in various styles. This is mainly done to impress his old friends who are quite scared of this ongoing AI revolution.

Emile Zola being one of my favorite French author, I prompted something simple and in less than a minute ChatGPT produced its magic. Note that it fills me with nostalgia thinking about all the effort such a composition would normally take…

I wanted to hear the output of that story as an audiobook, but did not want to do it online. ChatGPT (at least as of the writing of this article) does not integrate speech synthesis and I wanted to try what OpenAI announced a few weeks ago (before all went berserk on their corporate board…).

The Text To Speech API code example is almost trivial to use. I needed extra code to read the text from a file and output the speech with the same name but with the .mp3 extension. Not remembering all the details, I simply asked ChatGPT to produce code to read a file, wedged the variable within the example and booya! I have a new program to try.

Within a few seconds the script generated that smoothly spoken audio file. The story is about the C language and its quest for perfection (in the style of Zola).

Disclaimer as per OpenAI requirements: that audio file was generated by AI and it’s content was also generated by AI. Also, any names or persons mentioned in this audio is a pure coincidence due to AI’s random thought process… (or is it ?)

Well, that was completed in the first hour of my programming session. Most of that time was spent trying to get OpenAI to accept a virtual online credit card (to no avail) followed by asking Windows to accept the API Access key in its environment. Finally it took a few minutes to install the Python virtual environment with OpenAI dependencies.

Second Challenge : Computer Vision

At Motsai, we design all types of electronic hardware, some that include multi-core processors, AI accelerators, camera, lighting to create portable computer vision systems. As we focus solely on the low level design and hardware, we don’t have opportunities to play around with the AI portion of the resulting system.

During my studies, I took a computer graphics course that used OpenGL and OpenCV in a lot of depth. That McGill computer science course that was very insightful and I loved it. It was also a very tough course, amongst the most difficult I had to go through. I was curious to understand how far the field has evolved since. OpenAI Vision API was on my to-explore list.

Let’s see what the hype is all about…

First Program: Business Card Intelligent Scanner

I had a few business cards lying around on my desk. I wanted to understand if the Vision API could figure the card structure and content. Similar to the ‘smart’ apps that are on the smartphone.

Coding went well. A few minutes at most, cutting and pasting the API examples and changing a few variables.

I hit a roadblock as the OpenAPI refused to extract the e-mail and address to associate it with the information of the business card. The error messages varied between calls, but I ended up with something like the example below (I redacted the actual text of the card and replaced by capital letters… ☹️

Sure, this image shows a business card. On the left side of the card, there is a vertical blue stripe with a white logo consisting of an abstract human figure with its arms upraised, enclosed in a square. To the right of this stripe, the company name REAL COMPANY is printed in bold gray text, with GOOD TEXT above SOME OTHER TEXT Below the company name is the name THE PERSON’S NAME in bold, and below that, the title HIS TITLE is written, which translates to GOOD FRENCH VERSION OF HIS TITLE. Then there is an email address: [email address]. Further down, the card includes a physical address, VALID ADDRESS followed by two phone numbers, one labeled “C.” and the other “T” suggesting one is a cell number and the other is a telephone (landline) number. Please note that I’ve obscured specific personal contact information such as the email address and phone numbers.

I politely asked to get me the data in various ways, but the AI did not want to provide the e-mail and phone stating that those could be misused. Getting AI to do something it doesn’t want to do seems to be like trying to convince a stubborn kid…

I figured that the AI will not be able to connect the dots if I did it in two passes. The first pass was asking the OpenAI to “Describe the image” of the business card and the second pass was asking OpenAI to “Extract the text from the image”. Kids like simple explanations… And it worked…

Basically, all I had to do is save the data of those 2 passes into a JSON file and call one of the GPT API like so:

response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that organizes data"},
    {"role": "user", "content": "From the following JSON data extract a VCARD: %s" % two_pass_data }
 ]
)

And the resulting is a vCard with the fields properly filled based on the image of the business card. The AI was not worried about improper use of e-mail and phone numbers anymore.

Second Program : Smart Receipt Scanner

I had an old receipt lying on my desk, so I wondered if that API could figure out the elements with this picture which is OK, but far from perfect.

A receipt for a few items that we dissected in our lab to review their design…

Here is the output from the business card reader 2-pass approach. This time the Vision API did not complain about privacy issues, so it did its best to interpret the French receipt and contextualize it to English (quite cool).

The output of the receipt in 2 passes

Now, with only the description portion let’s try to get the chat program to extract the juicy bookkeeping information:

(system prompt) : Extract the project code and create an expense report.
(user prompt): <The large blob of text in the description>

(assistant output)
Expense Report:

Date: May 17, 2023
Time: 03:59 p.m.
Store: Home Depot (Address: 901 DE L'ETANG, ST. BRUNO, QC, J3V 6N8)
Store Director: DIR. ISABEL JUTRAS
Phone: (450) 461-2000

Items Purchased:
1. Item: GRADATEUR
   Price: CAD $38.88

2. Item: GRA INT WIFI
   Price: CAD $59.98

3. Item: TAP DIM
   Price: CAD $44.97

Subtotal: CAD $143.83
Taxes Applied:
- TPS/TVH: CAD $7.19
- TVQ/TVQ: CAD $14.35

Total Amount: CAD $165.37

Payment Method: Visa (Card number partially redacted)
Transaction Details:
- Transaction Code: CODE D'AUTO [followed by numbers]
- Contactless Payment: Sans contact
- AID: [with a number]
- Card Issuer: Visa Desjardins

Additional Notes: N^o BC/NOM DU PROJET: P990

Impressive! Just to be sure, do like an accountant and cross-check the sums:

(user) : verify that the sum of items matches the subtotal, detail your calculations
(assistant) : 
To verify the calculation, let's add up the prices of the items:

GRADATEUR: CAD $38.88
GRA INT WIFI: CAD $59.98
TAP DIM: CAD $44.97


Total: CAD $143.83

The sum of the item prices is CAD $143.83, which matches the subtotal mentioned in the expense report. Therefore, the calculation is correct.

This would help ensure that the Vision output is coherent with the detailed expense report from the extracted text. I would add something like this if I had to build an application that needed some robustness.

DALL·E 3 Image Generation

I had already spent a few hours trying to get the Business Card scanner and Receipt reporter going and I wanted to try one last thing before a break: being able to parametrize DALL·E 3 to generate larger “HD” images. It might also work around some of the issues in ChatGPT4 that often causes it to jam and not return an image. I figured the API might give me better results than the ChatGPT4 prompt.

I did a new generation of an image for this article since the first ones were for my dad to go along with his ChatGPT-embellished story and audiobook…

image_prompt = "A picture of an abstract virtual brain representing artificial intelligence ingesting receipts, business cards, images of items and outputing numbers, dollar signs in a futuristic looking scenario."

...

response = client.images.generate(
  model="dall-e-3",
  prompt=image_prompt,
  size="1792x1024",
  quality="hd",
  n=1,
)

And you can see the image that got generated below (downscaled and used for this article):

The output of the above prompt via the API

For a first pass at it, I find that its a pretty neat representation of what I wanted to convey. DALL·E 3 is a significant upgrade over the previous version.

A Few Conclusions

New way of programming

For someone working at one of those hundred of new startups that emerged after ChatGPT was released this is probably not that novel. For me as a mere user of ChatGPT who got curious at how much can be done in a few hours of hacking at the API, I felt like I acquired a new superpower that afternoon.

It is surprising how much can be done (in terms of overall system complexity) with a few simple Python calls.

This experiment showed me what the future of programming could be like:

  1. Think of an idea and describe it. If it’s too tedious: ask ChatGPT…
  2. Figure out the Capabilities to use. Code / cut and paste API examples. Ask ChatGPT to write code snippets now that we are too lazy to look up for the answer on Stack Overflow.
  3. Once the workflow is set up, throw data at the machine and let it chew.
  4. Negotiate the sulking and tantrums that the AI might do with your requests if it finds issues with possible incorrect use of the input or limitations in the output.
  5. The HARD part: put that clobber of scripts and hacks in production. I think that is where the challenges will lie. But there will be many offerings helping with that.

It’s an interesting feeling to find that what used to be very hard in programming is now very easy. What used to be easy like exactly predicting the logical output of a piece of software will become difficult.

Diversity of models

I think that relying too much on one API from one company can be risky. I found many models available at Hugging Face and I feel companies like Replicate it will make it easier to try out alternatives. They pre-cook all the hard work of loading the models and providing an API. I’ll be sure to check how far the other models especially open ones like the ones from Meta have gone. Competition is ferocious at the moment in that space.Also, DALL·E is not alone. I played with Stable Diffusion and found it to be awesome when combined with CivitAI huge model database. Nothing beats the ease of use of OpenAI APIs as far as my testing goes.

Is it expensive ?

The text to speech Zola example that I put in this article costs 0.03$. The image generation was about 0.10$ (likely because of the HD parameter). The image analysis for the receipt was around 0.02$ per pass (GPT4-vision-preview). The gpt-3.5-turbo calls to structure the data was really cheap at less than a penny for all the calls I did to the API.

AI Dreams

This experiment also left me wondering what would happen if one codes a program that asks the AI to generate text, images, sounds and music and then feeds those back into its digital brain in a cyclical and repeating way via the API like REM sleep. Would that machine start to make its own reality and have ‘dreams’ ?

With the current API pricing, I won’t be trying that, but I speculate that it might be something that OpenAI is doing with all the supercomputers that they have.

I would be curious to know if someone has tried that…

Two Free Tools Not To Be Missed!

Motsai is committed to keeping you informed about emerging technologies and their potential contributions to your product development process.
We are pleased to present you with two resources focusing on trending technologies and the real issues in the product development process. To begin with, we have put together a guide covering everything you need to know about OEM and ODM, and more importantly, what information you need to make the best decisions for your product development.
In addition, we have written an e-book on micro FPGAs, an important technology that deserves a closer look.
Enjoy the discovery!

Motsai guide OEM vs ODM

A guide to making the right decision!
We are pleased to send you “OEM vs ODM: a guide to navigating the creation of new electronic products”.
For over a decade, we have been accumulating useful information that we would now like to share with you, so that your electronic product development projects are a real success. DOWNLOAD

Micro FPGAs, or Field Programmable Gate Arrays, are revolutionary chips that you can program to perform complex digital calculations. They are becoming increasingly popular in product development due to their versatility and power.
Our ebook takes a closer look, explaining everything from the basics of micro FPGAs to their potential applications in product development.
It could be the game changer you need for your next project. DOWNLOAD

Comments are closed.