Riffusion – Stable Diffusion fine-tuned to generate music

Other author here! This got a posted a little earlier than we intended so we didn't have our GPUs scaled up yet. Please hang on and try throughout the day!

Meanwhile, please read our about page http://riffusion.com/about

It’s all open source and the code lives at https://github.com/hmartiro/riffusion-app --> if you have a GPU you can run it yourself

This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.

Wow, I am blown away. Some of these clips are really good! I love the Arabic Gospel one. John and George would have loved this so much. And the fact that you can make things that sound good by going through visual space feels to me like the discovery of a Deep Truth, one that goes beyond even the Fourier transform because it somehow connects the aesthetics of the two domains.

I can simultaneously burst a bubble and provide fuel for more -- the alignment of the intrinsic manifolds of different domains has been an interesting research topic for zero shot research for a few years. I remember seeing at CVPR 2018 the first zero shot...classifier, I think? That if I recall correctly trained in two domains that were automatically basically aligned with each other enough to provide very good zero shot accuracy.

Calling it a Deep Truth might be a bit of an emotional marketing spin but the concept is very exciting nonetheless I believe.

It is a Deep Truth in that the universe is predictable and can be represented (at least the parts we interact with) mathematically. Matrix algebra is a hellova a drug. I could imagine someone developing the ability to listen to spectrograms by looking at them.

There is a whole piece in Godel Escher Bach where they look at vinyl records as alll the soud data is in there.

I can't listen to them, but I can certainly point out different instruments, background noise sources and the like, and get an idea of the tone of a piece. This is easy. The hard part is distilling texture, timbre etc. of each sound.

Well it's no surprise that it kinda sorta works. Neural networks are very good at learning the underlying structure of things and working with suboptimally represented inputs. But if working with images of spectrograms works better than just samples in time domain, that is a valid and non-obvious finding.

My characterization of it as a Deep Truth might just be a reflection of my ignorance of the current state of the art in AI. But it's still pretty frickin' cool nonetheless.

Alright so this is a pretty amazing new development. I want to tell you something about what the state of the art is in AI. When you wrote that it is a deep truth it was before I actually listened to the pieces. I had just read the descriptions. At the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that's where art comes from. Music is purely sort of abstract or artistic. However, after I listened to the pieces, I realised that they really sound exactly like the instruments that are making the physical noises. For example it really sounds exactly like a physical piano. So I don't know about a deep truth, but it does seem that there is a physical sense that the music represents which it can successfully mimic using this essentially image generating capability. One thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything that I was saying I was absolutely blown away. However, it's really not that good at taking dictation, and I have to go back and replace each and every individual comma and period with the corresponding punctuation mark. Seeing such an amazing developments happening month after month year after year it makes me feel like we are really approaching what some people have called the singularity. When I read about a net positive fusion being announced my first instinct was to think oh of course it's now that that ChatGPT is available of course announcing a major fusion breakthrough would happen within days to weeks it just makes perfect sense that AI's can solve problems that have have confounded scientists for decades. To see just how far we still have to go take a look at how this comment read before I manually corrected it to what I had actually said.

-- [I copied and pasted the below to the above and then corrected it. Below is the original version. This is how I dictate to Google sometimes, on Android. Normally I would have further edited the above but in this case I wanted to show how far basic things like dictation still have to go. By the way I dictated in a completely quiet room. I can't wait for more advanced AI like ChatGPT to take my dictation.]

Alright so this is a pretty amazing our new development period I want to tell you something about out why the state of the heart is is in a i period when you wrote that it is a deep truth it was before I actually listen to The Pieces, I have just read the descriptions period at the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that Where Art comes from music is purely so dove abstract or artistic period however, after I listen to the pieces, I realise that they really sound exactly like the instruments that are making the physical noises period for example it really sounds exactly like a physical piano period so I don't know about out a deep truth karma but it does seem that there is a physical sense that the music are represents which it can successfully mimic using this essentially image generating capability period one thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything then was saying I was absolutely blown away period however, it's really not that good at taking dictation karma and I have to go back and replace each and every individual, and period with with the corresponding punctuation mark period seeing such an amazing developments happening month after month year after year ear makes me feel like we are really approaching what some people have called the singularity period when I read about out net positive fusion being announced my first Instinct was to think oh of course it's now that that chat GPT is available of course announcing a major fusion breakthrough would happen within in days to weeks it just makes perfect sense DJ eyes can solve problems that have have confounded scientists for decades period to see just how far we still have to go take a look at how this comment red before I manually corrected it to what I had actually set

As one of the meatsacks whose job you're about to kill... eh, I got nothin, it's damn impressive. It's gonna hit electronic music like a nuclear bomb, I'd wager.

As a listener, I think you're probably still safe. Can you use this to help you though? Maybe.

It's impressive what it produces, but I think it probably lacks substance in the same way the visual AI art stuff does. For the most part, it passes what I call the at-a-glanceness test. It's little better than apophenia (the same thing that makes you see shapes in clouds, faces in rocks, or think you've recognised a familiar word in a foreign language; the last one can happen more often though).

So, I think these tools will be used to do background work (ie for visuals maybe help with background tasks in CGI or faraway textures in games). I know less about audio, but I assume it could maybe help a DJ create a transition between two segments they want to combine, as opposed make the whole composition for them, but idk if that example makes sense.

Now, onto a more human point: I think that people often listen to music because it means something to them. Similar for people who appreciate visual art.

I also love interactive and light art, and I love talking to other artists at light festivals who make them because of the stories and journeys behind their art too. Humans and art are a package deal, IMO.

Edit: typos and to add: Also, I think prompt authorship is an art unto itself. I'm amazed what people can craft with it, but I'm more impressed by the craft itself than the outputs. Don't get me wrong, the outputs are darn cool, but not if you look closer. And it's impossible to look beneath the surface altogether, as there is nothing in the output but the pixels.

I think this type of generative stuff opens up entirely new possibilities. For the longest time I've wanted to host a rowing or treadmill competition, where contestants submit a music track. The tracks are mashed up with weighting based on who is in the lead and by how much.

I don't know of existing tech that can generate actual good mashups in realtime given arbitrary mp3s, but this has promise!

It's not too hard these days with open source BPM detection and stem separation libraries: https://github.com/deezer/spleeter

no, because is a function ("AI") that generates an image of a spectogram given text.

neither a set of MP3 nor a set of spectrograms from MP3s supplies the function arguments

or a connection to a path that uses that function

It says all StableDiffusion capabilities work, so you can prompt it with an image (either "img2img" or "textual inversion"). Their UI just doesn't expose it.

In general all this stuff is chopping the bottom off the market. AI art, code, writing, music, etc. can all generate passable "filler" content, which will decimate all human employment generating same.

I don't think this stuff is a threat to genuinely innovative, thoughtful, meaningful work, but that's the top of the market.

That being said the bottom of the market is how a lot of artists make their living, so this is going to deeply impact all forms of art as a profession. It might soon impact programming too because while GPT-type systems can't do advanced high level reasoning they will chop the bottom off the market and create a glut of employees that will drive wages down across the board.

Basic income or revolution. That's going to be our choice.

Basic income or revolution. That's going to be our choice.

Evolution.

We have such vast wealth and our historic methods for trying to make sure most people are taken care of are failing us. Those methods were rooted in the nuclear family with a head of household earning most of the money and jobs designed with an assumption that he had a full-time homemaker wife buying the groceries, cooking the meals etc so he could focus on his job.

We need jobs to evolve. In the US at least, we need to move away from tying all benefits (such as medical benefits and retirement) to a primary earner. We need to make it possible to live a comfortable life without a vehicle. We need to make it possible for small households to find small homes that make sense for them, both financially and in terms of lifestyle.

There is a lot we can do to make this not a disaster and make it possible for some people to survive on very little while while pursuing their bliss so that we stop this trend of pitting The Haves against The Have Nots and make the current Have Nots a group that has real hope of creating their own brilliant tech or such someday while not being utterly miserable if they aren't currently wealthy.

Those making the decisions can very well just say "WE and a 10-20% still needed just need to live comfortably, and the rest 80% can live in slums in the edge of town".

That sounds like the "revolution" option.

Sadly, if we look at human history, it usually resolves to that.

But not always. And education and communication are some of the forces that can help avoid that.

Knowledge is power.

education is controlled by government and communication is controlled by corporations

> Basic income or revolution. That's going to be our choice.

So many menial jobs are kind of like basic income anyway - you put in 2 hours of actual work to pad out the entire day at some shitty low end job, knowing all the time that your contribution isn't valued and that if your employer ever got their shit together your job wouldn't even be needed, and the robots are coming for it anyway. You get paid a small amount for doing nothing much useful.

The rich today are rich largely because they or their ancestors were plunderers. Perhaps they plundered the planet, exploiting the cheap energy that fossil fuels provide. Perhaps they plundered our social cohesion building skinner boxes that manipulate the minds of millions just to gain eyeballs and clicks.

Why should the bill for these past excesses fall on those who never benefited from them? In previous times, a young person of average intellect could get a job on a farm or factory and be a valued contributor. What happens when automation removes the last of these jobs - do we really expect people to put up with more and more menial and slavish existences?

Basic income is, like carbon taxes, an obvious solution. Maybe it will take off when a tipping point arrives - when the rich class decides that their repugnance to giving someone a "free ride" is overtaken by their need to have masses dulled and stupified, sitting at home with blinds drawn in front of their playstations, so they don't revolt at the obvious unfairness of the world.

Paying people to stay home and play on social media led to the mass explosion of conspiracy theories at the beginning of covid. Ultimately, unemployment and covid checks go a long way toward explaining the January 6th insurrection. Which is just to say that giving a free ride and narcotic forms of entertainment to the masses isn't necessarily the safety valve for "the rich" that it's made out to be.

Work gives people dignity. And idle hands are the devil's plaything. Put that together and UBI would be a disaster. Also, it's not "the rich class" who decides whether or not to bestow such a lifestyle on the masses... that in itself is a conspiratorial line of thought. Right down that road is the thought "hey, this UBI isn't enough!"

Jobs disappear. Other jobs replace them. Often, jobs are not fun, and often they feel meaningless, but working is still much more dignified than not working. Raising generations who've never worked and simply take their UBI and breed - what would even be the point of educating such people? Eventually they'd just be totally disposable and, no doubt, be disposed of.

Plunderers; well put. Capitalists cant lie their way to infinite growth forcasts and suck all the wealth into 401ks that do nothing but rob everyone elses grand children. Its a cycle that has been going on since existance itself, an ebb and flow that accelerates, crashes, and takes off again leaving its wake humanity as we know it.

Chopping the bottom off makes things higher up the ladder more accessible though. The original Zelda took six people multiple years to build, but one person could develop something similar but much better looking in a few weeks with Unity and AI generated assets. It obviously won't be a AAA title, but people have shown that they're happy to play slightly rough, retro games if they're fun. All this holds true for writing, music, art and other areas as well.

The big problem is that it's hard to filter through the huge amount of content being produced by humans to find things that you'll like, so we rely on kingmakers curating the culture. This means a few huge winners taking all and a lot of great creative work at the same level going unrewarded. If we can solve the content discovery problem in a more personalized and fair way and make it easier for people to support creators they like that would go a long way towards cushioning the job losses that AI will create.

> Basic income or revolution. That's going to be our choice.

Third option. Mandatory 4 day weeks.

Although I'd specify it as no one can work more than X hours a week.

And then adjust X down - or up in short timescales but likely down overall - as needed.

The competition is for "work". If AI is taking large chunks of "work" off the table. Spread the rest of it around.

Now notionally people will tell you that there is no finite "work" limit. You are effectively limiting competition.

To which I say - good. The rat race IS the competition. Don't we all want to slow it down a little? If F1 can put limits on a race, we should too for humanity.

Work smarter, not harder.

The only thing that affects whether you have a job is the Federal Reserve, not how good productivity tools are. You always have comparative advantage vs an AI, so you always have the qualifications for an entry level job.

There will never be a revolution and there's no such thing as late capitalism. Well, not if the Fed does their job.

I see a lot of AI naysayers neglecting the comparative advantage part.

If AI completely eliminates low skill art labour from the job pool, it's not like those affected by it are gonna disintegrate, riot, and restructure society. They have the choice of filling an art niche an AI can't or they can spend that time learning other, more in-demand skills. This also ignores that fact that some companies would rather reallocate you to more profitable projects even if your art skills don't change.

Selling a product with relative value like a painting or a sculpture will always be an uphill battle. Now that there's more competition from AI, it just gives artists/businesses incentive to find what people want that an AI can't deliver. Worst case scenario, employment rates in this sector are rough while the market recalibrates. Interested to see how these technologies develop.

That seems a bit like wishful thinking.

People don't have unlimited ability to learn new skills. Training takes time, and someone who spent several years honoring their craft won't be able to pick up a new skill overnight.

On top of that, people have preferences regarding their work – even if someone has the ability to do a different work, they might find it less meaningful and less satysfing.

Finally, don't ignore the speed at which AI capabilities improve. Compare GPT-1 with the current model, and how quickly we got here. Eventually we'll get to a point where humans just won't be able to catch up quickly enough.

Agree 100%. When I was young and idealistic I believed in the "learn new skills" mantra, but learning completely new skills would look a lot different at 50 than it did at 20. When career choices were being made 30 years ago it would have been hard to predict the current & upcoming AI-driven destruction of lower-end "thinking" jobs. Attempting to retrain after ~30 years puts you at a massive disadvantage vs a new graduate (I mentor some of our companies graduates & trainees & I've been assigned a guy in his mid-40s, after a few months I just don't see how he'll get to a point where he's adding value). Not really a personal whinge, as my skillset isn't under immediate threat from any AI I've heard about, though the rate of change in the field is something to behold.

I agree that intentional retraining doesn’t really work, but I don’t think it matters. As I said, all that matters for whether you have a job is the Federal Reserve. If you hire random people to do a computer job, some of them will just turn out to magically learn everything on the job.

I think specifically in the area of creative "products" such as art and music you have to think about the customer as well. I have zero interest in AI-created art or music. None. The value of art is its humanity; its expression of the artist's message, vision, and passion. AI doesn't have that, so it's not of any interest to me.

I don't know how many custoners feel the same way, but I won't be purchasing any AI art or music or knowingly giving it any of my attention.

The AI is a tool the human used to make it. Sometimes clumsily, but sometimes they write poems as text prompts and it's an illustration, or things like that. If an AI is making and selling art by itself, it's probably become sentient and not patronizing it would be speciesism.

Although personally, I think using "AI art" to create impossible photographs is more interesting and doesn't compete with illustrators as much.

I think it's an interesting perspective but I will be very surprised if it's one that is common when this becomes more of a real choice. If there's two mp3s and one of them is more enjoyable to listen to, very few people will stick with the song they enjoy less because it's not AI generated.

Maybe a parallel would be furniture; there are people who buy hand crafted furniture but it's kind of a luxury. Most people just have Ikea and wouldn't pay more for the same (or have less good furniture) just to get some artisanal dinner table chair.

How would you even know? Vast majority of art you don't purchase directly, but as a part of some product. At most you get a line in the credits and what's stopping anyone from inventing a pseudonym for AI.

> I have zero interest in AI-created art or music.

I'm afraid in near future we will all bombarded with AI-created music, art and text whether we want it or not.

The top of the market started at the bottom. Entry level is requiring higher and higher skills and capabilities.

> basic income or revolution

I’ve been trying to play through the scenario in my head. At least in terms of software developers being replaced by AI, I think we’re going to first see AI doing work in parallel or under monitor by humans. Basically, Google will take AI and send it off to do work that they lack the staff to do. Now, on the other hand, they could also temporally play it out where first they feign an inability to staff people due to finances so there are layoffs/terminations, and then maybe a quarter later they replace those people with low cost AI compute time that is orders of magnitude more productive.

In any case, AI disrupting people’s ability to feed, shelter, and clothe themselves is sure to trigger a pretty brutal and hostile response, which would be grounds for legislation and perhaps a class war.

The weird part is that if the potential of AI is truly orders of magnitude expansion beyond what we already have, then the longterm surely has room for a tiny little mankind fief. But, in order to get to the long term our hyper-competitive technocratic overlords may strangle out part of or all of the rest of us while justifying accelerating through the near-term window to achieve AI-dominance.

If the only people who can have meaningful good paying jobs are thoughtful geniuses we're in a lot of trouble as a society still.

> Basic income or revolution. That's going to be our choice.

I fear you are right. But neither of those is going to be an easy transition, if only because the effects of all this innovation is felt disproportionally by people in countries where such a revolution will not do anything to give solace.

Basic income assumes that the funds to do this are available and revolution assumes that the powers that be are the parties that are in the way of a more equitable division of the spoils. Neither of those are necessarily true for all locations affected.

>Basic income or revolution. That's going to be our choice

Basic income sounds good in theory in some imaginary futuristic society of harmony and grace.

In real life, it's a way for the masses be controlled down to your very substinence by the state. Where the state is basically an intermediary for big private interests and lobbies.

It gets better very quickly and we have no idea where its limitations are. In other words, we have no idea when the development will slow down significantly and how much of the bottom will it have chopped down by then. Whether it's 10% or maybe a 100.

> Basic income or revolution. That's going to be our choice.

I'm definitely pro basic income, but I've heard an interesting remark a few weeks ago. And that's that COVID was kind of a UBI experiment (in the US), albeit very limited, and it turned out that if people don't have to worry about making a living and don't have a job to work in then they'll start do stupid things on the internet. Like make up stupid conspiracy theories about vaccines. I can't remember who said this, it was one of the guests on Lex Fridman's podcast. I'm also not sure if it's a valid analogy but reminds me of Vonnegut's Player Piano.

I think this is a good point. To make this useful for music creators, and to make music creation more generally accessible, the output needs to be more useful. We are working on that at https://neptunely.com

As a musician and listener i'm inclined to agree. There were a couple cool examples i bumped into, but some prompts generate results that don't represent any single word or combination of words that were presented to the AI.

What this means for the future is maybe a little more unsettling however.

I fully agree with what you wrote. This AI-generated music, while a great achievement, still sounds soulless. It's one thing to look at AI-generated pictures for a few seconds, but listening to this music with its gibberish "lyrics" for minutes really creeps me out - it's the "uncanny valley" all over again, I guess.

Regarding "can you use this to help you through?" - yeah, you could probably use it as a source of inspiration, but at the risk of getting sued by someone whos music you didn't even know you were copying...

Yea, it's uncanney valley, sure. For now.

With Stable Diffusion and similar generative systems we have seen a leap in generative art/media, partially with significant improvements within a few months. What makes you think this was the last or only leap in the next 5 to 10 years? As if progress would just stop here? Huh?!

Do you think we hit a ceiling were progress is only tangential? A line which is impossible to cross? Otherwise I dont get this mindset in the face of these modern generative AI systems popping up left and right.

Potentially it could be used as temporary atmospherical music for pre-viz video shots

These are tools. Don't think of them as replacements, they aren't. But as tools that will help us be creative. As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us but we need to adapt to a new reality.

I hear this a lot (in relation to various jobs) and I still don't get it. Yes, it is a tool. Yes, if it can, it will replace humans. That's the whole point.

For some reason people tend to think that these tools/AI/ML systems will never be good enough to do their job (or a specific job). This argument can take different forms, sometimes stating that it will just do the boring part of the work (e.g. with programming) or that it will still need human creativity (maybe, but not necessarily and that's not the point) or that it will just replace low level, unskilled or mediocre professionals. And somehow everyone thinks they are not mediocre (i.e. average). But even these assumptions are unfounded. Why would anyone think that these systems will top out below their skill levels? Why would anyone think that they can't become superhuman?

They did in chess, go, I think poker too. Not to mention protein folding. And without much of a hitch between mediocre/good enough and superhuman. Because that difference is just interesting for us, but doesn't necessarily mean that there is huge step, that the system needs to undergo serious development and that it would take a long time. (Like decades or so.) People thought that was the case when AlphaGo beat Fan Hui saying that Lee Sedol was a completely different level. Which, of course, he is. Still, it just took DeepMind half a year to improve alphago to that level.

So yeah, you can be pretty sure that if this track (no pun intended), if this solution is good enough then it will quickly evolve into something that will replace some music creators.

>As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us

Well, they will, if the AI plus 1 human deciding "where and how to use them" can replace producers and musicians playing...

It already is a replacement. You can make a visual novel video game with AI generated character art, backgrounds, music, run your dialogs though AI if you can't write well yourself - and your game will have higher production level than 90% of competition. All those artist you would normally hire or commission the above stuff from are now out of the process if you want. Sure, it's not a particularly high bar, but it's only going to raise from here.

Kurt Vonnegut and Player Piano has a message for you.

These will be full replacements in no time, give or take 10 years.

10 years‽ StableDiffusion was released August 22nd of this year!

I love simple generative approaches to get ideas, and go from there. This seems like an extension of that (well, it's what I'm going to try - sample the output, make stems, pull MIDI etc). Will make the creative process more interesting for me, not less.

Having said that, it's not my job, and I can see where the issues lay there.

I can't think of a genre that would embrace it faster. The pay-for-knock-off rap beat market will feel more pressure from this kind of tool, especially as loop-oriented as it already is.

Why do you think this will kill your job? To me this looks like an extension of the hip-hop genre.

All the AI music I’ve heard so far has a really unpleasant resonant quality to it. Why is that? Can it be removed?

I've done some work on AI audio synthesis and the artifacts you're hearing in these clips are coming from the algorithm that is used to go from the synthesized spectrogram to the audio (the Griffin-Lim algorithm).

Audio spectrograms have two components: the magnitude and the phase. Most of the information and structure is in the magnitude spectrogram so neural nets generally only synthesize that. If you were to look at a phase spectrogram it looks completely random and neural nets have a very, very difficult time learning how to generate good phases.

When you go from a spectrogram to audio you need both the magnitudes and phases, but if the neural net only generates the magnitudes you have a problem. This is where the Griffin-Lim algorithm comes in. It tries to find a set of phases that works with the magnitudes so that you can generate the audio. It generally works pretty well, but tends to produce that sort of resonant artifact that you're noticing, especially when the magnitude spectrogram is synthesized (and therefore doesn't necessarily have a consistent set of phases).

There are other ways of using neural nets to synthesize the audio directly (Wavenet being the earliest big success), but they tend to be much more expensive than Griffin-Lim. Raw audio data is hard for neural nets to work with because the context size is so large.

Phase is crtical for pitch. Here is why. The spectral transformation breaks up the signal into frequency bins. The frequency bins are not accurate enough to convey pitch properly. When a periodic signal is put through a FFT, it will land into a particular frequency bin. Say that the frequency of the signal is right in the middle of that bin. If you vary its pitch a little bit, it will still hand into the same bin. Knowing the amplitude of the bin doesn't give you the exact pitch. The phase information will not give it to you either. However, between successife FFT samples, the phase will rotate. The more off-center the frequency is, the more the phase rotates. If the signal is dead center, then each successive FFT frame will show the same phase. When it is off center, the waveform shifts relative to the window, and so the phase changes for every sample. From the rotating phase, you can determine the pitch of that signal with great accuracy.

Yes, this is exactly right and is why Griffin-Lim generated audio often has a sort of warbly quality. If you use a large FFT you can mitigate the issues with pitch because the frequency resolution in your spectrogram is higher, so the phase isn't so critical to getting the right pitch. But the trade-off of a bigger FFT is that the pitches now have to be stationary for longer.

The other place where phase is critical is in impulse sounds like drum beats. A short impulse is essentially just energy over a broad range of frequencies, but the phases have been chosen such that all the frequencies cancel each other out everywhere except for one short duration where they all add constructively. Without the right phases, these kinds of sounds get smeared out in time and sound sort of flat and muffled. The typing example on their demo page is actually a good example of this.

So what is phase? From dabbling with waveforms in audio editors, sampling, and later learning a little bit about complex numbers, phase seems eventually equivalent to what would sound like changing pitch, modulating the frequency of a periodic signal.

The simplest demonstration of it is the doppler shift. But it's not at all that simple because moving relative to the source the sound pressure and thus the perceived loudness also change, distorting the wave form, thereby introducing resonant frequencies. Now imagine that the transducer is always moving, eg. a plucked string.

The ideal harmonic pendulum swings periodically, only losing attenuation. But the resonant transducer picks up reflections of its own signal, like coupled pendulums, which are intractable according to the three body problem.

On top of that, our hearing is fine tuned to voices and qualities of noise.

Phase is the offset in time. The functions sin(θ) and sin(θ + c), for arbitrary real c, represent the same frequency signal; they are offset from each other horizontally by c, and that c is a phase difference. It has an interpretation as an angle, when the full cycle of the wave is regarded as degrees around a circle; and that's what I mean by rotating phase.

When you take a window of samples of a signal, and run the FFT on it, for every frequency bin, the calculation determines what is the amplitude and phase of the signal. If you have a frequency bin whose center is 200 Hz, and there is a 200 Hz signal, then what you get for that frequency bin is a complex number. The complex number's magnitude ("modulus") is the amplitude of that signal, and its angle ("argument"d) is the phase.

If the signal is exactly 200 Hz, and if the successive FFT windows move by a multiple of 1/200th of a second, then the phase will be the same in succcessive FFT windows.

But suppose that the signal is actually 201 Hz: a little faster. Then with each successive FFT window, the phase will not line up any more with the previous window; it will advance a little bit. We will see a rotating complex value: same modulus, but the angle advancing.

From how fast the angle advances relative to the time step between FFT windows, we can deduce that we are capturing a 201 Hz signal in that bin (on the hypothesis that we have a pure, periodic signal in there).

How is the phase determined in the frequency bin? It's basically a vector correlation: a dot product. The samples are a vector which is dot-producted with a complex unit vector. The complex unit vector in the 200 Hz bin is essentially a 200 Hz sine and cosine wave, rolled into a single vector with the help of complex numbers. Sine and cosine are 90 degrees apart in phase, so they form a rectilinear basis (coordinate system). The calculation projects the signal, expressing it as a sum of the sine and cosine vectors. How much of one versus the other is the phase. A signal that is 100% correlated with the sine will have a phase angle of 0 degrees or possibly 180. If it correlates with the cosine component, it will be 90 or 270. Or some mixture thereof.

Because a complex number is two real numbers rolled into one, it simplifies the calculation: instead of doing a dot product with a sine and cosine vector to separately correlate the signal to the two coordinate bases, the complex numbers do it in one dot product operation. When we go around the unit circle, each position on the circle is cos(θ) + isin(θ). These complex values values give us samples of both functions. Exactly such values are stuffed into the rows of the DFT matrix: complex values from the unit circle divided into equal divisions.

If you look here at the definition of the ω (omega) parameter:

https://en.wikipedia.org/wiki/DFT_matrix

It is the N-th complex root of unity. But what that really means is that it is a 1/Nth step of the way around the unit cicrcle. For instance if N happened to be 360, then ω is the complex number whose |ω| = 1 (unit vector), and whose modulus is 1 degree: one degree around the circle. The second row of the DFT matrix has 1, ω, ω², ω³, ... the second row represents the lowest frequency (after zero, which is the first row). It captures a single cycle of a sine and cosine waveform, in N samples. The values in that row step around the unit circle in the smallest increment, so they go around the circle exactly once. The subsequent rows go around the circle in skipped steps, yielding higher frequencies: 1, ω², ω⁴ for twice around the circle; 1, ω³, ω⁶ for three times, ... we get all the harmonics up to our N resolution.

> on the hypothesis that we have a pure, periodic signal in there

That pure sine wouldn't generate any artefacts. It would result in a 200Hz output from the AI if it throws the phase information out. You wouldn't hear a difference unless its an (aptly so called) complex signal. Eg. 200 and 201 Hz layered is an impure signal with a period below 1Hz, far outside the scope. Eventually the signals will cancel out completely. [1]

The important point is, I think, that FFT doesn't simply look at the offset aka phase. Rather, 201 Hz looks like a 200 Hz that is moving. So it encodes phase-shift in the delta of the offset between two windows. For a sum of 200 and 201 Hz it has to assume that the magnitude is also changing, which I find entirely counterintuitive.

From the mathematical perspective, this seems like a borring homework, far detached from accoustics. So, I don't know. The funny thing is that rotation is very real in the movement of strings. If the orbit in one point is elliptic, that's like two sinusoids at different magnitudes offset by some 90 degree, in a simplified model. But it has nearly infinite coupled points along its axis. As they exite each other, and each point has a different distance to the receiver, that's where phase shift happens.

> If you look here at the definition of the ω (omega) parameter

I wasn't going to make drone, but I will take a look.

1: https://graphtoy.com/?f1(x,t)=100*sin(x)&v1=true&f2(x,t)=100...

I wonder if this could be improved by using the Hartley transform instead of the Fourier transform.

Considering Stable Diffusion generates 3-channel (RGB) images, maybe it would be possible to train it on amplitude and phase data as two different channels?

People have tried that, but the model essentially learns to discard the phase channel because it is too hard for it to learn any useful information from it.

Got any citations... that sounds like a fascinating thing to read about.

We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.

Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?

But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.

That makes a lot of sense, I would be keen to see attempts at that.

I'm curious why, instead of using magnitude and phase, you wouldn't use real and imaginary parts?

There have been some attempts at doing this, some of which have been moderately successful. But fundamentally you still have the problem that from the NN's perspective, it's relatively easy for it to learn the magnitude but very hard for it to learn the phase. So it'll guess rough sizes for the real and imaginary parts, but it'll have a hard time learning the correct ratio between the two.

Models which operate directly on the time domain have generally had a lot more success than models that operate on spectrograms. But because time-domain models essentially have to learn their own filterbank, they end up being larger and more expensive to train.

I wonder if there might be room for a hybrid approach, with a time-domain model taking machine-generated spectrograms as input and turning them into sound. (Just a thought, no idea whether it actually makes sense.)

would it be an approach to use separate color channels for the freq amplitude and freq phase in the same picture? Maybe the network then has a better way of learning the relationships and there would be no need for the postprocessing to generate a phase.

RAVE attacks the phase issue by using a second step of training. I don't completely understand it, but it uses a GAN architecture to make the outputs of a VAE sound better.

Griffin-Lim is slow and is almost certainly not being used.

A neural vocoder such as Hifi-Gan [1] can convert spectra to audio - not just for voices. Spectral inversion works well for any audio domain signal. It's faster and produces much higher quality results.

[1] https://github.com/jik876/hifi-gan

If you check their about page they do say they're using Griffin-Lim.

It's definitely a useful approach as an early stage in a project since Griffin-Lim is so easy to implement. But I agree that these days there are other techniques that are as fast or faster and produce higher quality audio. They're just a lot more complicated to run than Griffin-Lim.

Author here: Indeed we are using Griffin-Lim. Would be exciting to swap it out with something faster and better though. In the real-time app we are running the conversion from spectrogram to audio on the GPU as well because it is a nontrivial part of the time it takes to generate a new audio clip. Any speed up there is helpful.

I think this is because the generation is done in the frequency domain. Phase retrieval is based on heuristics and not perfect, so it leads to this "compressed audio" feel. I think it should be improvable

The link is down now, so I don't know about this one. But most generated music is generated in the note domain, rather than the audio domain. Any unpleasant resonance would introduced in the audio synthesis step. And audio synthesis from note data is a very solved problem for any kind of timbre you can conceive of, and some you can't.

You're probably talking about the artifacts of converting a low resolution spectrogram to audio.

Can the spectrogram image be AI upscaled before transforming back to the time domain?

Yes it exists: https://ccrma.stanford.edu/~juhan/super_spec.html

But the issue is not that the spectrogram is low quality.

The issue is that the spectrogram only contains the amplitude information. You also need phase information for generating audio from the spectogram

Interesting, can't you quantize and snap to a phase that makes sense to create the most musical resonance?

What happens if you run one of the spectrogram pictures through an upscaler for images like ESRGAN ?

The first ever recordings had people shouting to get anything to register. They sounded like tin. Fast forward to today.

Looking back at image generation just a year or two ago and people would have said similar things.

Not hard to imagine the trajectory of synthesized audio taking a similar path.

It sounds kind of like the visual artifacts that are generated by resampling in two dimensions. Since the whole model is based on compressing image content, whatever it's doing DSP-wise is more-or-less "baked in", and a probable fix would lie in doing it in a less hacky way.

Presumably for similar reasons that the vast majority of AI generated art and text is off-puttingly hideous or bland. For every stunning example that gets passed around the internet, thousands of others sucked. Generating art that is aesthetically pleasing to humans seems like the Mt. Everest of AI challenges to me.

I think your comment is off-topic to the post you are replyng to. That wasn't asking about the general aesthetic quality - more about a specific audio artifact.

> For every stunning example that gets passed around the internet, thousands of others sucked.

From personal experience this is simply untrue. I don't want to debate it because you seem to have strong feelings about the topic.

Even if you remove the artifact, the exact same comment applies. It generates a somewhat less interesting version of elevator music. This is not to crap on what they did. As I said, they underlying problem is extremely difficult and nobody has managed to solve it.

I don't feel strongly about this topic at all.

> It generates a somewhat less interesting version of elevator music.

This iteration does, but that's an artifact of how it's being generated: small spectograms that mutate without emotional direction (by which I mean we expect things like chord changes and intervals in melodies that we associate with emotional expressions - elevator music also stays in the neutral zone by design).

I expect with some further work, someone could add a layer on top of this that could translate emotional expressions into harmonic and melodic direction for the spectrogram generator. But maybe that would also require more training to get the spectrogram generator to reliably produce results that followed those directions?

The vast majority of human generated art is hideous or bland. Artists throw away bad ideas or sketches that didn’t work all the time. Plus you should see most of the stuff that gets pasted up on the walls at an average middle School.

Hard disagree. The average middle school picture will have certain aspects exaggerated giving you insights into the minds eye of the creator, how they see the world, what details they focus on. There is no such minds eye behind AI art so it's incredibly boring and mundane, no matter how good a filter you apply on top of it's fundamental lack of soul or anything interesting to observe in the picture beyond surface level. It's great for making art for assets for businesses to use, it's almost a perfect match, as they are looking to have no controversial soul to the assets they use, but lots of pretty bubblegum polish.

Perhaps most of the AI art out there (that honestly represents itself as such) is boring and mundane, but after many hours exploring latent space, I assure you that diffusion models can be wielded with creativity and vision.

Prompting is an art and a science in its own right, not to speak of all the ways these tools can be strung together.

In any case, everything is a remix.

I have to agree, the act of coming up with a prompt is one and the same with providing "insights into the minds eye of the creator, how they see the world, what details they focus on" - two people will describe the same scene with completely different prompts.

And the vast majority of professionally produced artwork is for business use. It’s packaging design or illustration or corporate graphics or logos or whatever.

I don’t get the objection.

> For every stunning example that gets passed around the internet, thousands of others sucked

…implying there may be an art to AI art. Hmm.

Meanwhile, the degree to which it is off-puttingly hideous in general can be seen in the popularity of Midjourney — which is to observe millions of folks (of perhaps dubious aesthetic taste) find the results quite pleasing.

Not sure about this. Models like Midjourney seem to put out very consistently good images.

I've compiled/run a dozen different image to sound programs and none of them produce an acceptable sound. This bit of your code alone would be a great application by itself.

It'd be really cool if you could implement an MS paint style spectrum painting or image upload into the web app for more "manual" sound generation.

"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.

Amazing work! Did you use CLIP or something like that to train genre + mel-spectrogram? What datasets did you use?

Hi Hayk, I see that the inference code and the final model are open source. I am not expecting it, but is the training code and the dataset you used for fine-tuning, and process to generate the dataset open source?

The audio sounds a bit lossy, would it be possible to create high quality spectograms from music, downsample them, and use that as training data for a spectogram upscaler?

It might be the last step this AI needs to bring some extra clarity to the output.

Super clever idea of course. But leaving aside how it was produced, I’ll be one of those who is underwhelmed by the musicality of this. I am judging this in terms of classical music. I repeatedly tried to get it to just play pure piano music without any other add-ons (cymbals etc). It kept mixing the piano with other stuff.

Also the key question is - would something like this ever produce something as hauntingly beautiful and unique as classical music pieces?

This is amazing! This is a fantastic concept generator. The verisimilitude with specific composers and techniques is more than a little uncanny. A few thoughts after exploring today…

- My strongest suggestion is finding some strategy for smoothing over the sometimes harsh-sounding edge of the sample window - Perhaps it could be filling in/passing over segments of what is sounded to user as a larger loop? Both giving it a larger window to articulate things but maybe also showcasing the interpolation more clearly… - Tone control may seem challenging but I do wonder if you couldn’t “tune” the output of the model as a whole somehow (given the spectrogram format it could be a translation/scale knob potentially?)

When you say fine tuned do you mean fine tuned on an existing stable diffusion checkpoint? If so which?

It would be very interesting to see what the stable diffusion community that is using automatic1111 version would do with this if it were made into an extension.

Can you run this on any hardware already capable of running SD 1.5? I am downloading the model right now, might play with this later.

Guessing at the speed with which AI is developing these days someone is going to have the extension up in two hours at most.

I bet the AUTOMATIC1111 web UI music plugin drops within 48 hours.

Yes! Although to have real time playback with our defaults you need to be able to generate 40 steps at 512x512 in under 5 seconds.

Good to know. I was just so close with just under 7s using 40 steps and Euler a as sampler.

Hayk! How smart are you! I loved your work on SymForce and Skydio - totally wasn't expecting you to be co-author on this!

On a serious note, I'd really love some advice from you on time management and how you get so much done? I love Skydio and the problems you are solving, especially on the autonomy front, are HARD. You are the VP of Autonomy there and yet also managed to get this done! You are clearly doing something right. Teach us, senpai!

Hello - this is awesome work. Like other commenters, I think the idea that if you are able to transfer a concept into a visual domain (in this case via fft) it becomes viable to model with diffusion is super exciting but maybe an oversimplification. With that in mind, do you think this type of approach might work with panels of time series data?

Did you have a data set for training the relationship between words and the resulting sound?

Obviously this needs a little more polish, but I've wanted this for so long I'm willing to pay for it now if it helps push the tech forward. Can I give you money?

Amazing work! Do you plan on open-sourcing the code to train the model?

What sort of setup do you need to be able to fine tune Stable Diffusion models? Are there good tutorials out there for fine tuning with cloud or non-cloud GPUs?

Super! Makes sense since Skydio is also amazing.

How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?

To be honest, we're not sure how much value image pre training brings. We have not tried to train from scratch, but it would be interesting.

One thing that's very important though is the language pre-training. The model is able to do some amazing stuff with terms that do not appear in our data set at all. It does this by associating with related words that do appear in the dataset.

Hi, I really admire the skill you put at work on this project. At the same time, I think everyone is overlooking how crucial and problematic the training factor is.

Why was stable diffusion able to generate spectrograms? Because it was fed some. Presumably, those original spectrograms were scraped with little concern over creators' permissions, just like it has been for artists' work in order to produce art-looking image generation. Please, research what has been happening in the art community lately. https://www.youtube.com/watch?v=Nn_w3MnCyDY

A protest on ArtStation has been shown to influence Midjourney's results, proving that huge amounts of proprietary work are constantly scraped without the creators' permission. AIs like these work so well just because they steal and remix real artists' work in the first place. There are going to be legal wars about this.

Stable Diffusion doesn't have an official music generation Ai precisely because it couldn't train it with the same approach without being sued by music labels right away, while isolated artists don't have the same power.

So, back to my question: have you wondered whose work is Stable Diffusion remixing here? Your endeavour is great technically, but as we progress into the future we have to be more aware of the ethical implications that come with different forms of progress.

You could try to base your project on a collection of free-to-use spectograms, and see how it performs. If you do, I think it could actually be very interesting and useful to discuss the results here on Hacker News.

Cheers!

What I would really like to know - what happens if one trains that model from scratch (or is that not possible and training requirements are different? Sry for my ignorance, I never fine-tuned any diffusion model before)?

In my experience (CNN based imagery segmentation) proven architectures (e.g. U-Net) performed similar with or without fine-tuning existing models (that have been mostly trained on imagenet, citiscapes, etc.) IF the domain was rather different.

At least in the field of imagery segmentation there is not much of a point in fine-tuning an off-the-shelf model on let's say medical imagery.

So maybe it's the same for the stable diffusion model. I don't see how some knowledge about the relationship between the prompt and given imagery describing that prompt should help this model map the prompt to a spectrogram of the given prompt.

You can embed images in spectrograms.. might sound weird though

This is groundbreaking! All other attempts at AI generated music have IMO, fallen flat... These results are actually listenable, and enjoyable! This is almost frightening how powerful this can be

Reach out to the Beatstars CEO. He was looking for an AI play for his music producers marketplace. Probably solid B2B lead there.

Amazing work. Can this be applied to voice?

Example prompt: “deep radio host voice saying ‘hello there’”

Kind of like a more expressive TTS?

Author here: It can certainly be applied to voice, but the model would need deeper training to speak intelligibly. If you want to hear more singing, you can try a prompt like "female voice", and increase the denoising parameter in the settings of the app.

That said, our GPUs are still getting slammed today so you might face a delay in getting responses. Working on it!

The site isn't working for me? Anything I have to fix on my side to make it work?

Crashes repeatedly on iOS in Firefox (my usual browser), is OK on Safari though, so probably not a webkit thing.

This is super awesome.

Have you already explored doing the same with voice cloning?

How many songs did you use for the training data?

is classical music harder? noticed you didn't have any classical music tracks. i wonder if it is because it is more structured?

funny that Hayk is an early skydio guy!

2 amazing AI projects. Huge respect :)

This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won't notice. Move a spectral element up or down a bit and you have a completely different sound. I don't understand how this can possibly be precise enough to generate anything close to a cohesive output.

Absolutely blows my mind.

Author here: We were blown away too. This project started with a question in our minds about whether it was even possible for the stable diffusion model architecture to output something with the level of fidelity needed for the resulting audio to sound reasonable.

Any chance of spoken voice-work being possible? It would be interesting to see if a model could "speak" like James Earl Jones or Steve Blum.

Excellent work! Singing would be amazing - karaoke can finally sound good :p

Have you released a tool for volumetric capture? I'm applying this to LED lighting fixture setup for tv/film/live shows and 3D positioning is the last step to fully automated configuration.

My goal is real-time sync between 3D model and real world.

Are there any open source models with good quality?

I had a look around several months ago, and it seems like everything is locked behind SaaS APIs.

have a look at UberDuck, they do something like this

Wasn't this Fraunhofer's big insight that led to the development of MP3? Human perception actually is pretty forgiving of perturbations in the Fourier domain.

You probably mean Karlheinz Brandenburg, the developer of MP3, who worked on psychoacoustics. Not completely off though, as he did the research at a Fraunhofer research institute, which takes its name from Joseph von Fraunhofer, the inventor of the spectroscope.

Does the institute not also claim that work?

Fair enough. But for me, when talking about `having an insight`, I don't imagine a non-human entity doing that. And to be pedantic (talking about Germans doing research, I hope everyone would expect me to be), the institute is called Fraunhofer IIS. `Fraunhofer` would colloquially refer to the society, which is an organization with 76 institutes total. Although, of course, the society will also claim the work...

It's an interesting question, one I hadn't thought of before. But in common language, it sometimes makes sense to credit the institution, others just the individuals. I think may be more based around how much the institution collectively presents itself as the author and speaks on behalf of the project versus the individuals involved. Here is my own general intuition for a few contrasting cases:

Random forests: Ko and Breiman's, not really Bell Labs and UC-Berkeley

Transistors: Bardeen, Brattain, and Shockley, not really Bell Labs (thank the Nobel Prize for that)

UNIX: Primarily Bell Labs, but also Ken Thompson and Dennis Richie (this is a hard one)

GPT-n: OpenAI, not really any individual, and I can't seem to even recall any named individual from memory

Bringing the right people together and having the right environment that gives rise to „having an insight“ can be a big part as well.

In very limited situations. You can move a frequency around (or drop it entirely) if it's being masked by a nearby loud frequency. Otherwise, you would be amazed at the sensitivity of pitch perception.

The easy example of this is playing a slightly out of tune guitar, or a mandolin where the strings in the course aren't matched in pitch perfectly. You can hear it, and it's just a few cents off.

You can also add another neural-network to "smooth" the spectrogram, increase the resolution and remove artefacts, just like they do for image generation.

It's...not effective though. Am I listening to the wrong thing here? Everything I hear from the web app is jumbled nonsense.

I think we're at the point, with these AI generative model thingies, where the practitioners are mesmerized by the mechatronic aspect like a clock maker who wants to recreate the world with gears, so they make a mechanized puppet or diorama and revel in their ingenuity.

And that's a bad thing?

How do you think human endeavours progress other than by small steps?

Look at GAN art from a few years ago, compared to MidJourney v4.

Really? They sound quite clearly like the prompt to me if I “squint my ears” a little

This is a genius idea. Using an already-existing and well-performing image model, and just encoding input/output as a spectrogram... It's elegant, it's obvious in retrospective, it's just pure genius.

I can't wait to hear some serious AI music-making a few years from now.

This idea is presented by Jeremy Howard on literally their first Deep Learning for Coders class (most recent edition). A student wanted to classify sounds, but only knew how to do vision, so they converted sounds to spectrograms, fine tuned the model on the labelled spectra, and the classification worked pretty well on test data. That of course does not take the merit away from the Riffusion authors though.

The idea of connecting CV to audio via spectrograms pre dates Jeremy Howard's course by quite a bit. That's not really the interesting part here though. The fact that a simple extension of an image generation pipeline produces such impressive results with generative audio is what is interesting. It really emphasizes how useful the idea of stable diffusion is.

edit: added a bit more to the thought

The idea to apply computer vision algorithms to spectrograms is not new. I don't know who first came up with it, but I first came across it about a decade ago.

I just ran a quick Google Scholar search, and the first result is https://ieeexplore.ieee.org/abstract/document/5672395

This is from 2010. I didn't go looking, but it wouldn't surprise me if the idea is older than that.

There were a number of systems designed for composers in the 90s (also continuing through to today) designed for the workflow of converting a sound to a spectrogram, doing visual processing on the image, and then re-synthesizing the sound from the altered spectrogram. Many were inspired by Xenakis' UPIC system which was designed around the second half of this workflow: you'd draw the spectrogram with a pen and then synthesize it.

https://en.wikipedia.org/wiki/UPIC

Edit: my favorite of all these systems was Chris Penrose's HyperUPIC which provided a lot of freedom in configuring how the analysis and synthesis steps worked.

Makes me wonder if we will see a generalization of this idea. Just like in a CPU 90%+ of want you want to do can be modeled with very few instructions (mov, add, jmp..) we could see a set of very refined models (Stable difussion, GPT, etc) and all of their abstractions on top (ChatGPT, Rifussion, etc).

and you ask stable diffusion to generate piet code for a slightly better version of stable diffusion (or chatGPT) ...which then you can further use to generate a better version, and so on. Singularity here we come!

Perhaps GPT could run on top of Stable-diffusion, generating output in the form of written text (glyphs).

Indeed, I think this would be a cost-effective way to go forward.

For what is worth, people were trying the same thing with GANs (I also played with doing it with stylegan a bit) but the results weren't as good.

The amazing thing is that the current diffusion models are so good that the spectograms are actually reasonable enough despite the small room for error.

As someone who loves making music and loves listening to music made by other humans with intention, it just makes me sad.

Sure, AI can do lots of things well. But would you rather live in a world where humans get to do things they love (and are able to afford a comfortable life while doing so) or a world where machines do the things humans love and humans are relegated to the remaining tasks that machines happened to be poorly suited for?

As someone who loves making music and loves listening to music (regardless of its origins, in my case), it doesn't make me that sad. Sure, at first, I had an uncomfortable feeling that AI could make this sacred magic thing that only I and other fellow humans know how to do... But then I realized same thing is happening with visual art, so I applied the same counterarguments that've been cooking in my head.

I think that kind of attitude is defeatist - it's implying that humans will be stopped from making music if AI learns how to do it too. I don't think that will happen. Humans will continue making music, as they always have. When Kraftwerk started using computers to make music back in the 70s, people were also scared of what that will do to musicians. To be fair, live music has died out a bit (in a sense that there aren't that many god-on-earth-level rockstars), but it's still out there, people are performing, and others who want to listen can go and listen.

Maybe consumers will start consuming more and more AI music, instead of human music [0], but the worst thing that can happen is that music will no longer be a profitable activity. But then again, today's music industry already has some elements of the automation - washed-out rhythms, sexual thematics over and over again, re-hashing same old songs in different packages... So nothing's gonna change in the grand scheme of things.

[0] https://www.youtube.com/watch?v=S1jWdeRKvvk

> but the worst thing that can happen is that music will no longer be a profitable activity.

For me, the worst that could happen is that people spend so much time listening to AI generated music, that human musicians can no longer find audiences to connect to. It's not just about economics (though that's also huge). It's the psychological cost of all of us spending greater and greater fractions of our lives connected to machines and not other people.

Music was always about people. Even today, as most people listen to the mass-produced run-of-the-mill muzak, there is still a significant audience that seeks the "human element" for the sake of itself.

Black metal community, for example, has always rejected all forms of "automation" and considers it not kvlt - rawness is a sought-after quality, defined as having people performing as close to the recording equipment as possible.

There's also a rapper named Bones (Elmo O'Connor) who's never signed a contract with a label, does only music he wants to do, releases a couple albums every year. There's something about his approach that makes his music sound very organic and honest. I listen to him more than I listen to any mass produced rapper.

So in conclusion, music was always about people. Unless AI reaches AGI level, I don't think it will ever impact music enough to kill all audience.

I agree with this so much that I’d take off the comment about AGI: I think that even if/when AGI lands there will always be audiences that seek raw, organic, and honest artistic expression from other humans.

The vast majority of music produced is listened to by nobody, or a handful of people, so this is already the case really.

I play the piano (badly). There are many other people who can play much better than I. There are simple computer programs which can play better. It doesn't stop me from enjoying it or playing it. Computers have been beating people at Chess for years yet you still see people everywhere enjoying the game. At some point computers will be better than humans at absolutely everything but it shouldn't stop you as a human from enjoying anything.

Sure, but a large part of enjoyment in creativity for many is the joy of sharing it with an audience. To the degree that people are spending their available attention on AI-generated content, they have less time and attention available to spend listening to and watching art created by humans.

I would rather live in a world where humans get to do things they love because they can (and not because they have to earn their bread), and machines get to do basically everything that needs to be done but no human is willing to do it.

Advancing AI capabilities in no way detracts from this. You talk about humans being "relegated to the remaining tasks" - but that's a consequence of our socioeconomic system, not of our technology.

> but that's a consequence of our socioeconomic system, not of our technology.

Those two are profoundly intertwined. Our tech affects our socioeconomic systems and vice versa.

Sure, so now that we have new tech, let's update the socioeconomic system to accommodate it.

It's not, but at least it's feasible. Trying to suppress technology instead is futile in the long term.

Musicians already make much (most?) of their money via gigs and I don't think going to watch an AI play at a gig will be all too common. I think we'll be fine. Might have to adapt though.

I'd rather live in the world where humans do things that are actually unique and interesting, and aren't essentially being artificially propped up by limiting competition.

I don't see this as a threat to human ingenuity in the slightest.

There are still chess tournaments for humans, even though our smartphone could play chess better than any grandmaster.

I'm super excited about the Audio AI space, as it seems permanently a few years behind image stuff - so I think we're going to see a lot more of this.

If you're interested, the idea of applying Image processing techniques to Spectrograms of audio is explored in brief in the first lesson of one of the most recommended AI courses on HN: Practical Deep Learning for Coders https://youtu.be/8SF_h3xF3cE?t=1632

> I can't wait to hear some serious AI music-making a few years from now.

I think this will be particularly useful for musical compositions in movies and film, where the producer can "instruct" the AI about what to play, when, and how to transition so that the music matches the scene progression.

Not only that but sampling. I'd say there's at least one sample from something in most modern music. This can essentially create "sounds" that you're looking for as an artist. I need a sort of high pitched drone here... Rather than dig through sample libraries you just generate a few dozen results from a diffusion model with some varying inputs and you'd have a small sample set on the exact thing you're looking for. There's already so much processing of samples after the fact, the actual quality or resolution of the sample is inconsequential. In a lot of music, you're just going after the texture and tonality and timbre of something... This can be seen in some Hans Zimmer videos of how he slows down certain sounds massively to arrive at new sounds... or in granular synthesis... This is going to open up a lot of cool new doors.

I was thinking gaming where music can and should dynamically shift based on different environmental and player conditions.

I suspect that if you had tried this with previous image models the results would have been terrible. This only works since image models are so good now.

You already hear a ton of them. Lofi music on these massively popular channels are basically auto-generated "music" + auto generated artwork.

I dabble in music production and know some of the people in the "Lofi" world, so I know for a fact that this is not true. It's just a formulaic sub-genre where people are trying to make similar instrumentals with the same vibe. It would be jarring to listen to a playlist while studying and each song had wildly different tempos, instruments, etc.

Also, the music doesn't sound "Lofi" because it's generated by algorithms. A lot of hard work and software goes into taking a clean, pitch-perfect digital signal and making it sound like something playing on a record player from the 70s.

Do you have any sources for more information about this?

Some of this is really cool! The 20 step interpolations are very special, because they're concepts that are distinct and novel.

It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition's lacking, too. It's loop-y.

Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn't inherently have any notion of what a key or mode is?

As a human musician and producer I'm super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)

It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.

> composition's lacking, too. It's loop-y.

Well no wonder, it has absolutely no concept of composition beyond a single 5s loop, if I understand correctly.

> It absolutely sucks at cymbals, though. Everything sounds like realaudio :)

> It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.

I haven't sat down yet to calculate it, but is the output of SD at 512*512px at 24bit enough to generate audio CD quality in theory?

No.

And I suspect this will always have phase smearing, because it's not doing any kind of source separation or individual synthesis. It's effectively a form of frequency domain data compression, so it's always going to be lossy.

It's more like a sophisticated timbral morph, done on a complete short loop instead of an individual line.

It would sound better with a much higher data density. CD quality would be 220500 samples for each five second loop. Realtime FFTs with that resolution aren't practical on the current generation of hardware, but they could be done in non-realtime. But there will always be the issue of timbres being distorted because outside of a certain level of familiarity and expectation our brains start hearing gargly disconnected overtones instead of coherent sound objects.

What this is not doing is extracting or understanding musical semantics and reassembling them in interesting ways. The harmonies in some of these clips are pretty weird and dissonant, and not what you'd get from a human writing accessible music. This matters because outside of TikTok music isn't about 5s loops, and longer structures aren't so amenable to this kind of approach.

This won't be a problem for some applications, but it's a long way short of the musical equivalent of a MidJourney image.

Generally we're a lot more tolerant of visual "bugs" than musical ones.

I think an approach like this could generate interesting sounds we as humans would never think of. Or meshing two sounds in ways we could barely imagine or implement.

But of course something like this, which only thinks in 5s clips can not generate a larger structure, like even a simple song. Maybe another algorithm could seed the notes and an algorithm like this generates the sounds via img2img.

>and not what you'd get from a human writing accessible music

The timbral qualities of the posted samples remind me of some of the stuff I heard from Aphex Twin, like Alberto Balsalm. Not accessible by a long shot but definitely human

This is huge.

This show me that Stable Diffusion can create anything with the following conditions:

1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)

2. Acceptable with a certain amount of lossiness on the encoding/decoding

3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.

This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.

Can stable diffusion work across more than two dimensions?

Now I'm wondering about feeding Stable Diffusion 2D landscape data with heightmaps and letting generate maps for RTS videogames. I mean, the only wrinkle there is an extra channel or two.

Any image generator can do well in any two dimensional data, including SD, Dalle, Midjourney.

One feature of SD not discussed much in my opinion, is the deterministic key it provides to the user. This is what enables the smooth transition in every second of music it generates, and the next second in time. Moving the cursor of latent space between in a minimal way, creating the next piece of information and change it ever so slightly, it definitely sounds good to the human ears.

Being able to blend between prompts and attention weightings smoothly from a fixed seed is definitely a fantastic and underexplored avenue; it makes me recall "vector synthesis" common in wavetable synthesizers since the '80s as discussed here[0]. I feel we are just a couple of months from seeing people start using MIDI controllers to explore these kinds of spaces. Something could be hacked together today, but it will be interesting to see once the images can be generated in nearly realtime as the controls are adjusted.

[0] https://www.soundonsound.com/techniques/synth-school-part-7

>Being able to blend between prompts and attention weightings smoothly from a fixed seed is definitely a fantastic and underexplored avenue;

Agree totally. Before SD was created, i thought that it is impossible to replicate a prompt more than once. Deterministic/fixed seed is a big innovation of SD, and how well it works in practice, it is simply amazing.

From the article: >Those who tried this method, however, soon found that, without analogue filters to run through the harmonic content of waveforms, picking out and exaggerating their differing compositions, most hand‑drawn waveforms sounded rather ordinary and often bland, despite the revolutionary way in which they were created.

Yes, the technique which the people of riffusion created, displayed to everyone, and shared it as well, it is the holy grail of electronic music synthesis. I would imagine it is has some way to go before it is applied to electonic music effectively, integration with some tools, practice of musicians on the new tool, some fine tunning etc.

I would argue that its high-fidelity representations of 3d space, imply that the model's weights are capable of pattern-matching in multiple dimensions, provided the input is embedded into 2d space appropriately.

Can you expand on what you mean with the identity/security services?

Something unlikely to be affected: OIDC, PGP, etc. as these require signals that have full fidelity to authorize access.

Something likely to be affected: anything using biometrics as a password instead of a name.

I think there has to be a better way to make long songs...

For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you'd get pretty good results.

And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).

Perhaps you could do a hierarchical approach somehow, first generating a "zoomed out" structure, then copying parts of it into an otherwise unspecified picture to fill in the details.

But perhaps plain stable diffusion wouldn't work - you might need different neural networks trained on each "zoom level" because the structure would vary: music generally isn't like fractals and doesn't have exact self-similarity.

Authors here: Fun to wake up to this surprise! We are rushing to add GPUs so you can all experience the app in real-time. Will update asap

Awesome, there is another project out there that does it with CPU https://github.com/marcoppasini/musika maybe mix the both, ie take initial output of musika, convert to spectrogram and feed it to riffusion to get more variation...

"fine-tuned on images of spectrograms paired with text"

Fascinating stuff.

One of the samples had vocals. Could the approach be used to create solely vocals?

Could it be used for speech? If so, could the speech be directed or would it be random?

I bet a cool riff on this would be to simply sample an ambient microphone in the workplace and use that the generate and slowly introduce matching background music that fits the current tenor of the environment. Done slowly and subtly enough I'd bet the listener may not even be entirely aware its happening.

If we could measure certain kinds of productivity it might even be useful as a way to "extend" certain highly productive ambient environments a la "music for coding".

>in the workplace

Or at a house party, club or restaurant... as more people arrive or leave and the energy level rises or declines..or human rhythms speed up or slow down...so does the music...

Or perhaps use it in a hospital to play music that matches the state of a patient’s health as they are passing away.

I would not want to go to the hospital for a mild ear infection and hear the AI start blasting death metal.

This opens up ideas. One thing people have tried to do with stable diffusion is create animations. Of course, they all come out pretty janky and gross, you can't get the animation smooth.

But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.

That's an interesting idea. I wonder if this would work with inpainting - erase the 16th cell and let the AI fill it in. Then upscale each frame. Has anyone experimented with this?

Well Look at that. I'm totally not surprised, lol.

Producing images of spectrograms is a genius idea. Great implementation!

A couple of ideas that come to mind:

- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.

- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.

I think you'd have to start with separate spectrograms per instrument, then blend the complete track in "post" at the end.

The vocals in these tracks are so interesting. They sound like vocals, with the right tone, phonemes. and structure for the different styles and languages but no meaning.

Reminds me of the soundtrack to Nier Automata which did a similar thing: https://youtu.be/8jpJM6nc6fE

That's glossolalia, and it's not that uncommon in human-created art.

I think AI would be great at generating similar things. Might be very nice for generating fake languages, too.

Earlier this year, graphic designers, last month it was software engineers, and now musicians are also feeling the effects.

Who else will AI make looking for a new job?

Honestly none of them should. I think the moral panic around these things is way overstated. They are cool but hardly about to replace anyone's job.

Have you tried AI asset generators? They are working extremely good. Just yesterday a friend of mine has shown me the progress they made in their game. It is incredible. Designers are 100% loosing their job over this.

I'm a professional game developer and excited AI enthusiast.

While I've seen a lot of cool stuff which helps generation for hobby projects or smaller indie games it's nowhere near the quality and consistency needed to come close to the work of a skilled human artist at a larger studio.

Yes, and I studied Game Development in Germany for 3 years in Düsseldorf, got third at the national gameforge newcomer award with my team "Northlight Games" and still have many connections to the people in the business (if this somehow matters). The quality for 2d assets is on very professional level and already replaced jobs in projects I know.

To give you an example join the public discord of https://www.scenario.gg and check out results. Come back and tell me those aren't on a professional level.

I am not saying that designers won't be needed anymore but AI is definitely able to replace jobs and speed up progress in game development.

Musicians were made to get a day job long before you were born ;)

Although I do wonder how much an earlier technology, audio reproduction, contributed to that. My grandmother worked for a time as a piano player as part of a nightclub orchestra. It was a stable job back then. I have to wonder how many musician jobs were killed off by the jukebox and related technologies.

If I was a musician, this post would not make me worry for a second

If a hack based on an image generator already has promising results for music generation, then imagine what will happen if something dedicated to music is built from the ground up.

Politicians, bureaucracy.

GPT-3, what policy should we apply to increase tax revenue by 5% given these constraints?

GPT-3, please tell me some populist thing to say to win the next election, or how should I deflect these corruption charges.

"We should place a tax on all copyright lawyers and use it to fund GPU manufacturing and AI development. At your next stump speech, mention how the entertainment industry is stealing jobs from construction workers. Your corruption charges won't matter because voters only care about corruption when it's not in their favor."

The raw outputs of these tools will be best consumed by experts. Until general AI, these are just better tools for the same workers.

They were killed off by the ability to record the data. Every city used to have their own music stars :)

This was the first AI thing to fill me with a feeling of existential dread.

What is with the hyperbole in this thread? This stuff sounds like incoherent noise. It is noticeably worse than AI audio stuff I heard 5 years ago. What is going on with the responses here?

I feel exactly the opposite way, but I suppose everyone has a different ear and taste. I think a good 3-4% of what this produces sounds damn amazing and beautiful. I've been vibing to it a lot. Fantastic stuff! There is also the feeling of shock and awe like with ChatGPT where you give it a prompt about a niche thing you think it will definitely not understand and it turns out it understands it shockingly well. As an example I just gave it a prompt "Avril 4th" and the result literally gave me chills.

I assume the stuff from 5 years ago was essentially spitting out a midi output which would be fed in to a traditional tool to play samples. So it's going to sound a lot sharper while being a lot less sophisticated. The real breakthrough here is this is generating everything from scratch and it still resembles the prompt.

One of the automated prompts was "Eminem anger rap", I'm confident if you had showed me the audio without the prompt I could identify which artist it sounded like.

And this is just a basic first attempt at reusing a tool not even designed for audio. I can only imagine how powerful it could be after some trivial revisions like using GPT-3 to generate coherent lyrics.

Usage of an image generator to produce passable music fragments, even if they sound a bit distorted, is very surprising. That type of novelty is why we come here.

People did the same with GANs years ago with similar odd results. I do think the kinks will eventually be ironed out but i don’t think this is it.

203 more comments...

Riffusion – Stable Diffusion fine-tuned to generate music

Recommend

Add a pref that opens the existing migration.xhtml document in a gDialogBox wind...

Here's How To Stop Your iPhone Messages From Ringing Twice

【懒人周末】强化学科类隐形变异培训防范治理；爬楼机已成“悬空老人”刚需

你以为的真相就是真相吗？

How The Mars Rover Opportunity Was A Scientific Jackpot For NASA

Wealthiest People in Hong Kong (December 17, 2022)

Raising the Bar

The Feasibility of Big Eyes Coin (BIG) Conquering Avalanche (AVAX) And Aave (AAV...

【原版伴奏】G.G 张思源给陌生的你听至夏版伴奏精品制作纯伴奏听妈妈的话厦门吉...

Implement new experimental SQLite integration module by aristath · Pull Request...

About Joyk