So I've been working on a project to re-implement the VOCALOID1 engine. I'm basing it on the description in Jordi Bonada's PhD thesis "Voice Processing and Synthesis by Performance Sampling and Spectral Models" and not the original papers as the form

Name
Options
Comment
File	Or URL:
Whitelist Token

Video Stream Embedding
Advanced Options	Always Noko Always Sage
Video Timestamp
Captcha Type	Captchouli
Spoiler	Unset Spoiler Image NSFW Image
Password	(For file deletion.)
Markup tags exist for bold, itallics, header, spoiler etc. as listed in " [options] > View Formatting "

File:Screen Shot 2026-03-02 at ….png (444.29 KB,2003x1640)

VOCALOID1 MFPA implementation QueueSevenM!Tnq5UWtkfs 03/03/26 (Tue) 04:40:43 No.31314

So I've been working on a project to re-implement the VOCALOID1 engine.
I'm basing it on the description in Jordi Bonada's PhD thesis "Voice
Processing and Synthesis by Performance Sampling and Spectral Models"
and not the original papers as the former is more detailed, easier to
follow, and also describes the VOCALOID2 engine.

After a lot of trouble with getting TWM f0 estimation to work, I've
finally gotten to implementing MFPA. And amazingly, it seems to have
worked first try.

Compare my results:
https://[link-ommited][link-ommited][link-ommited]

To the results in the study:
https://[link-ommited][link-ommited][link-ommited]

Deleted by Verniy from >>>/jp/ Post No. 112653 (OP)

Anonymous 03/02/26 (Mon) 21:59:10 No.31315

neat

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:06:06 No.31316

File:Screen Shot 2026-03-02 at ….png (282.25 KB,1754x1278)

Also here's the graph of the f0 estimate from the TWM. It's still somewhat flawed (see the jump down at frame 37), and it required using unusual parameters (Kaiser-Bessel beta [link-ommited] instead of the [link-ommited] recommended by the study, and only 6 harmonics instead of 11) to avoid instabilities even in relatively trivial scenarios.

This graph specifically shows the estimated fundamental frequency for each 256-point frame of an E4 /e/ phoneme.

Moved by followup

Anonymous 03/02/26 (Mon) 22:08:12 No.31317

File:[MoyaiSubs] Mewkledreamy -….jpg (282.54 KB,1920x1080)

I'm too dumb for this as I say every time, but cool!
You might as well be explaining quantum mechanics and brain surgery to me at the same time.

So uhh... what does it mean in dumb people words?

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:12:34 No.31318

>>31317
Basically, when you talk or sing, your glottis opens and closes at the rate of the fundamental frequency of the sound you are making. When we want to transform the sound, we break it up into "analysis windows" and apply the short-term fourier transform to convert into a spectrum that we can transform. If the spectrum isn't aligned to glottal pulses such that the pulses are in the middle of the analysis window, it will sound bad. So we use this algorithm to shift the analysis windows such that the pulse onsets are centered.

Moved by followup

Anonymous 03/02/26 (Mon) 22:21:24 No.31319

>>31318
Hmm... so it sounds better. Something similar to autotune?

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:22:49 No.31320

>>31319
>Something similar to autotune?
I guess that's kind of how VOCALOID works. MFPA specifically is about alignment, because if it's not aligned to the glottal pulses, it will result in artifacts in the final sound after transformations are applied.

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:25:44 No.31321

Honestly, you might be able to understand even if you don't think you can. If you skim through a paper, it probably will seem like you don't understand - I know I didn't feel like I would understand when I first skimmer through the paper. But if you read through it from the beginning slowly and carefully, it will make sense. If your interested, here's the link to the paper: https://[link-ommited][link-ommited][link-ommited]

Moved by followup

Anonymous 03/02/26 (Mon) 22:30:42 No.31322

>>31320
Hmm, I see, I see. "Alignment" to be sounds like something would be made more harmonious and pleasant sounding since "noise" (unpleasant sounds) is basically a bunch of spiky sound waves, isn't it?
Are you planning on doing stuff with this or are you satisfying exploring what it all means? I can't can't understand it so I don't know what this entails, like "I am doing X so I can do Y tomorrow", but I don't understand the X so I don't know what Y you may be working towards.

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:44:45 No.31323

>>31322
>Hmm, I see, I see. "Alignment" to be sounds like something would be made more harmonious and pleasant sounding since "noise" (unpleasant sounds) is basically a bunch of spiky sound waves, isn't it?
No it's about phase alignment, not about the frequency envelope.
>Are you planning on doing stuff with this or are you satisfying exploring what it all means? I can't can't understand it so I don't know what this entails, like "I am doing X so I can do Y tomorrow", but I don't understand the X so I don't know what Y you may be working towards.
I would like to release it as an open-source library eventually when I'm done. I would to include support for both the VOCALOID1 and VOCALOID2 engines.

You might think there's an issue with patents, but I think I should be fine after doing quite a bit of research. VOCALOID1 is from 2003 and all the patents have expired. VOCALOID2 still has active patents, but these pertain to additional techniques and uses (cross-synthesis/XSY, growl, chorus, and real-time synthesis) and not the core engine.

Well except for one patent. As far as I can tell, there is exactly one patent that covers the core synthesis engine of VOCALOID2+ - US patent 8,706,496. At first, I thought this spelled big trouble for the project. But I was able to reach out to none other than Dr. Jordi Bonada himself - the author of the paper and the inventor of VOCALOID - and I was able to get a clarification that the patent only pertained to a specific technique for precise estimation of harmonic sinusoidal parameters. Very recently, I realized an important detail, all three independent claims in the patent contain an element specifying a specific interpolation technique for non-power-of-two FFT; and there are no dependent claims mentioning the use of other interpolation methods. Meaning, I believe I can just use a different interpolation method and it shouldn't fall under the scope of the patent. In fact, a specific interpolation method I think might be both more efficient to compute and more accurate. I have not tested any of this yet, but I see no reason why it shouldn't work.

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 22:45:52 No.31324

Also, when I'm done with VOCALOID1 engine, I'm thinking of making a post here soliciting samples of your own voices if you would like to here what they sound like when ran through the engine.

Moved by followup

Anonymous 03/02/26 (Mon) 22:53:55 No.31325

>>31323
>I would like to release it as an open-source library eventually when I'm done. I would to include support for both the VOCALOID1 and VOCALOID2 engines.
>You might think there's an issue with patents, but

Well, my advice would be to released it while announcing it at the same time, just in case. Lots of people spend months announcing things on twitter trying to get recognition and then before they can release anything they get the DMCA stuff or other legal challenges. But once it's out there it's out there, so announce it at the same time as the download.

Moved by followup

Anonymous 03/02/26 (Mon) 23:43:36 No.31326

>>31314
This is neat, good luck to you OP. If you need voice samples i'm sure the fellas over at 39chan would probably help, they've made a bunch of collaborative voicebanks in the past.

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/02/26 (Mon) 23:45:21 No.31327

>>31326
>If you need voice samples i'm sure the fellas over at 39chan would probably help, they've made a bunch of collaborative voicebanks in the past.
I'm not looking to create a voice bank. I meant that in the sense that could submit a few samples to here what your voice would sound like if ran through VOCALOID.

Moved by followup

Anonymous 03/02/26 (Mon) 23:53:02 No.31328

File:fascinating.jpg (20.16 KB,249x238)

>>31314
Interesting. What do you plan on doing with this experiment afterwards?.

Moved by followup

QueueSevenM!Tnq5UWtkfs 03/03/26 (Tue) 00:02:00 No.31329

>>31328
I plan on releasing as an open-source library for VOCALOID enjoyers and people interested in vocal synthesis to experiment with. It's written in clean C and does not use any external libraries.

Moved by followup

Anonymous 03/03/26 (Tue) 01:55:20 No.31330

>>31314
What is there to the vocaloid engines? Is this any different from a synth plugin you might add to FL Studio?

Moved by followup

Anonymous 03/03/26 (Tue) 02:00:58 No.31331

I suppose the vocaloid engine is a modifier for voices allowing you to adjust sounds to have different waveforms? So you're creating the engine behind a dashboard you might find on FL Studio that allows you to adjust sounds to your liking, except this is more complicated since you're attempting to replicate human waveforms and not accentuate frequencies in existing sounds to adjust timbre?

Moved by followup

Anonymous ## Mod 03/03/26 (Tue) 02:10:46 No.31332

Moved to >>>/maho/6499.

Moved by followup

/trans/ - Transparency and Moderation

New Reply