Ever wondered how well ChatGPT can interpret drawings? I recently put it to the test in a captivating side project that combines Unreal Engine 5, the GPT4-Vision API, and the Spotify API. Let’s break it down into the steps that bring this experience to life.
This project is a collaborative effort involving a Python script, OBS, and Unreal Engine. OBS captures webcam input, aligns the frame, and outputs it as a new feed. The Python script then saves each frame in real-time, which Unreal Engine accesses as a texture. When you press a button in Unreal Engine, it opens the webcam feed in a small window. You can draw your artwork and scan it. When scanned, Unreal Engine retrieves the current frame from the disk and passes it to GPT4-Vision through blueprints.

ChatGPT reads the image, finds a matching popular song, and provides a short description of why it fits the drawing. The song information is sent back to the Python script, which queries the Spotify API for a 30-second audio sample and cover art. Both are imported into Unreal Engine, where the audio sample fades in, the cover art adorns a spinning record, and the description is placed in 3D space. The scene’s lighting is influenced by the cover art, adding a subtle colored glow.

While this prototype runs smoothly, it’s not without challenges. To start with the Spotify API, which is a great API, and has loads of interesting features. Like returning the song genre, a popularity score of each song, a loudness score, the BPM or how much of the track is spoken words. Altough this API sometimes struggles with certain requests, possibly due to unavailable audio samples or missing cover art for certain tracks.
GPT4-Vision, while excellent with simple sketches, can struggle with more complex images or objects outside its scope. Like selfies or images which contain faces of people. Additionally, its processing time isn’t instantaneous, requiring some patience as it takes a couple of seconds to scan the image. But still fairly quick in my experience.

What’s next? I’m eager to experiment more with GPT4-Vision and the Spotify API. I’m considering merging this concept with a previous Unreal Engine project that reacts to live audio, creating synchronized visuals. It’s like a virtual DJ and VJ in one. It can see, find matching tracks, play them and create stunning visuals. All in one.
If you have any questions about this project, don’t hesitate to reach out on nils@nilsbakker.nl. I’d be more than happy to share more insights!