Re: GPU for audio processing. It isn't "around the corner" nor ever coming, I'm afraid, as the G stands for Graphics, and the dispatching of data to a GPU occurs in batches (high thoroughput but high and indeterminate latency, the latter being a deal-breaker for audio) is both significantly faster in one direction than the other (you have to mmap your GPU memory to write back to the CPU, and even with the fastest openMPI techniques it's not really a bounded-latency thing you can count on. You can only read/write in complete pages, your data alignment will not make efficient use of this.)
TLDR; GPU's handle big chunks of data typically in a one-way direction before writing to their memory holding the 2-dimensional screen pixel data every (1/frame-rate) seconds, and this actually drives the screen with pictures. It's an output device, not typically an I/O device and it's hardware is specialized for it's main tasks in a highly competitive market.
If you want an APU, they exist, but typically not on silicon (anymore) due to market forces and software.
A DSP or FPGA is suitable hardware for doing audio processing and it's a matter of writing an I/O driver, (trivial, LOL...) which is what your UAD and Waves SoundGrid processing use, respectively. Where a DSP will generically execute code, you generally have things like lots of Multiply-And-Accumulates (MAC) for filtering operations, and data-widths will support double-precision floating point, etc... (even old-school TMS320 type stuff)
So, your computer probably already has DSP cores that have special instructions to use them instead of the CPU for certain operations.
(old school PC was SSE, and on ARM it's NEON)
SIMD is "single instruction, multiple data" for doing the same operation on lots of data in parallel, "vector extensions" are also a means of processing vectors or arrays, etc.
So, the issue here is that a DAW or VST instrument designer will be loathe to use anything that's not standard universal old and boring vs limiting their customers to specific devices. Performance will vary across implementations and profiling shows this, etc... (must be an issue for all the MacOS devs having to rewrite parts of their plugins even with reasonable emulation as it breaks on specific algos etc... now that they moved from x86 to ARM all the way)
Digital Audio requires a precisely clocked stream of sample values for both accurate recording and playback (hence the market in studio master clock sources and the desire to avoid clock skew between systems etc. Think of a bad playback clock as a realtime pitch-shifting algorithm)
A typical audio driver arranges for the computers DMA engine to repetitively transfer data from the audio chips serial audio interface to RAM, and/or vice-versa for playback.
It then pulls an interrupt informing the computer that a buffer-full of audio is ready or needed.
Your DAW will typically register IT'S callback function with the driver and this callback function gets called to supply or offload a buffer full of sample values.
(buffer as in the setting you set, like 128 samples (2.9ms), or 256, or 1024 or whatever, it's directly related to latency.)
Your VST/AU plugin code will have either a "generate a buffer full of samples" function or a "process a buffer full of samples" function and will do it's primary work on this "thread" of execution, it's all just a wrapper for that same DMA interrupt saying "hey folks gimme or take, like NOW, not later or you will have a glitch" in some sense, if you drill down through it all...
Sorry for the random lecture, my mind wandered.... this vaporizer works, I see...
Re: GPU for audio processing. It isn't "around the corner" nor ever coming, I'm afraid, as the G stands for Graphics, and the dispatching of data to a GPU occurs in batches (high thoroughput but high and indeterminate latency, the latter being a deal-breaker for audio) is both significantly faster in one direction than the other (you have to mmap your GPU memory to write back to the CPU, and even with the fastest openMPI techniques it's not really a bounded-latency thing you can count on. You can only read/write in complete pages, your data alignment will not make efficient use of this.)
TLDR; GPU's handle big chunks of data typically in a one-way direction before writing to their memory holding the 2-dimensional screen pixel data every (1/frame-rate) seconds, and this actually drives the screen with pictures. It's an output device, not typically an I/O device and it's hardware is specialized for it's main tasks in a highly competitive market.
That just not true ....

Real-Time Noise Suppression Using Deep Learning | NVIDIA Developer Blog
Imagine waiting for your flight at the airport. Suddenly, an important business call with a high profile customer lights up your phone. Tons of background noise clutters up the soundscape around you…
developer.nvidia.com
Much more is to come. Teh sheer amount of power available in a GPU is ridiculous compared to the CPU (as they have the advantage ofnot havnig to be general purpose)
If you want an APU, they exist, but typically not on silicon (anymore) due to market forces and software.
APUs are just GPUs physically attached to the main CPU ... its a bit cheaper to swap data between the CPU and APU than CPU to GPU because they share the same memory .. but its much slower than VRAM. So horses for courses.
A DSP or FPGA is suitable hardware for doing audio processing and it's a matter of writing an I/O driver, (trivial, LOL...) which is what your UAD and Waves SoundGrid processing use, respectively. Where a DSP will generically execute code, you generally have things like lots of Multiply-And-Accumulates (MAC) for filtering operations, and data-widths will support double-precision floating point, etc... (even old-school TMS320 type stuff)
FPGAs are just not the sort of thigns you'll find on commodity hardware ... nor will you find much software support.
PCs don't tend to have DSPs as the SIMD instruction, as you point out below, cvontain most of the fucntionality.
So, your computer probably already has DSP cores that have special instructions to use them instead of the CPU for certain operations.
(old school PC was SSE, and on ARM it's NEON)
SIMD is "single instruction, multiple data" for doing the same operation on lots of data in parallel, "vector extensions" are also a means of processing vectors or arrays, etc.
So, the issue here is that a DAW or VST instrument designer will be loathe to use anything that's not standard universal old and boring vs limiting their customers to specific devices. Performance will vary across implementations and profiling shows this, etc... (must be an issue for all the MacOS devs having to rewrite parts of their plugins even with reasonable emulation as it breaks on specific algos etc... now that they moved from x86 to ARM all the way)
There are strictly just DSP-like instructions ... Qualcomm, for example, include a separate DSP core with their snap dragon processors which all support NEON.
Almost all floating point processing is done by SSE these days ... just sometimes (often indeed) they only use the scalar functions and don't take advantage of the parallel functions ... (denoted in the assembly instructions as ending in ss or ps).