When I started working on voice-based Parkinson's detection, the term MFCC showed up in every paper I read. Mel-Frequency Cepstral Coefficients. At first it sounded like pure jargon. But once I actually understood what it does and why it was designed the way it was, it became one of my favorite things in signal processing.

This post is the explanation I wish I'd had when I started.

The Problem with Raw Audio

When you record someone's voice, what you get is a waveform. Just a long sequence of numbers representing air pressure over time. It contains everything: the words, the pitch, the breathing, the room noise. All of it mixed together.

You could technically feed this directly into a machine learning model, but it would struggle. The raw signal is too high-dimensional, too sensitive to irrelevant variation like background noise or recording volume, and not structured in a way that reflects how we actually perceive sound. You need a smarter representation.

That's the whole point of MFCCs.

What MFCCs Actually Do

MFCCs transform a raw audio signal into a compact set of numbers that describe the texture and quality of the voice at a given moment. Not what words were said, not the pitch exactly, but the overall shape of the sound spectrum in a way that's tuned to human hearing.

The key insight is the Mel scale. Human ears are not linear. We're very sensitive to differences between 200Hz and 400Hz, but barely notice the difference between 8000Hz and 8200Hz. The Mel scale compresses high frequencies and expands low ones to match how we actually perceive sound. By building our features around this scale, we get representations that are much more meaningful for speech-related tasks.

Think of it as a fingerprint for a voice. A small set of numbers that captures its texture and character in a way that's both human-relevant and machine-readable.

The Extraction Pipeline

Here's what happens step by step when you extract MFCCs from a voice recording:

01Pre-emphasis : Boost the high frequencies slightly to balance the spectrum, since low frequencies naturally dominate audio.
02Framing : Cut the signal into short overlapping windows, typically 25ms each. Voice changes over time so you need to analyze it locally.
03Windowing : Apply a Hamming window to each frame to smooth out the edges and reduce spectral artifacts.
04FFT : Convert each frame from time domain to frequency domain using the Fast Fourier Transform.
05Mel Filter Bank : Pass the spectrum through a set of triangular filters spaced on the Mel scale to mimic auditory perception.
06Log + DCT : Take the log of the energies, then apply a Discrete Cosine Transform to decorrelate the features and compress them down to typically 13 coefficients.

The result is a vector of 13 numbers per frame. That's it. Those 13 numbers capture the essential spectral shape of the voice at that moment, stripped of pitch and loudness variation that we don't need for most tasks.

Why This Matters for Parkinson's Detection

Parkinson's affects the motor system, including the muscles involved in producing speech. Patients often develop what's called dysphonia: subtle changes in voice quality. Breathiness, slight tremor, reduced pitch variation, changes in how smoothly they sustain sounds. These changes can appear before the tremors in the hands become obvious.

MFCCs capture exactly these kinds of spectral changes. In my research, I combine them with features like Jitter (how much the pitch fluctuates cycle to cycle), Shimmer (how much the amplitude fluctuates), and HNR (the ratio of clean harmonic sound to noise). Together they give a rich picture of voice health.

In practice we extract 13 MFCCs per frame plus their delta and delta-delta coefficients, giving 39 features that capture not just the static voice quality but how it changes over time. This combination consistently ranks among the most informative in feature importance analyses for PD classification.

A Quick Code Example

With Librosa in Python, extracting MFCCs takes about four lines:

import librosa
import numpy as np

y, sr = librosa.load("voice_sample.wav", sr=22050)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_mean = np.mean(mfccs, axis=1)  # shape: (13,)

# Feed mfcc_mean into any classifier you want

What MFCCs Can't Do

They're not perfect. MFCCs throw away phase information and don't capture prosody or timing well. They're also sensitive to background noise, which is a real problem in clinical settings where recordings aren't always clean.

Modern deep learning models like wav2vec 2.0 learn their own representations directly from data and can outperform handcrafted features when you have enough training data. But in medical voice analysis, datasets are almost always small. In that context MFCCs remain competitive, easier to interpret, and much faster to work with.

For anyone starting out with voice-based medical AI, learning MFCCs properly is still absolutely worth it. Understanding why they work gives you the intuition to design better experiments and interpret your results more carefully.

← Back to Blog