The Short-Time Fourier Transform and Speech Denoising
This module explains one form of the short-time Fourier transform (STFT) and how to use it to filter a noisy speech signal. Specifically, the module describes the STFT with 50% overlap and the perfect reconstruction conditions the window should satisfy.

Introduction

The short-time Fourier transform (STFT) of a signal consists of the Fourier transform of overlapping windowed blocks of the signal. In this note, we assume the overlapping is by 50% and we derive the perfect reconstruction condition for the window function, denoted w(n)w(n).

The window w(n)w(n) is assumed to be supported (non-zero) for n=0,,N-1n=0,,N-1. For example, [link] shows a window of length N=10N=10. In this example, the window is a symmetric half-cycle sine window. The N-point half-cycle sine window is given by

w ( n ) = sin π N ( n + 0 . 5 ) , n = 0 , , N - 1 . w ( n ) = sin π N ( n + 0 . 5 ) , n = 0 , , N - 1 .

The figure shows several shifted windows. The shifted windows are shifted by half the length of the window. (In practice, the window is much longer than 10 samples. A short window is used here for illustration.)

A window w(n)w(n) of length N=10N=10 and several shifted windows w(n-m·N/2)w(n-m·N/2).

The m-th windowed block of the signal x(n)x(n) is given by x(n)·w(n-m·N/2)x(n)·w(n-m·N/2). For example, [link] shows several windowed blocks. The m-th windowed block is denoted as s(m,n)s(m,n):

s ( m , n ) : = x ( n ) · w ( n - m · N / 2 ) s ( m , n ) : = x ( n ) · w ( n - m · N / 2 )

For example, [link] shows the m-th windowed block for m=0,,4m=0,,4.

Windowed blocks x(n)·w(n-m·N/2)x(n)·w(n-m·N/2). The shifted windows are shown in [link].

The short-time Fourier transform is obtained by taking the DTFT of each windowed block:

S ( m , ω ) : = DTFT { x ( n ) · w ( n - m · N / 2 ) } S ( m , ω ) : = DTFT { x ( n ) · w ( n - m · N / 2 ) }

The short-time Fourier transform of a discrete-time signal x(n)x(n) is denoted by

S ( m , ω ) = STFT { x ( n ) } . S ( m , ω ) = STFT { x ( n ) } .

In practice, the DTFT is computed using the DFT or a zero-padded DFT.

Inverse STFT

The inverse STFT begins with the inverse DTFT of S(m,ω)S(m,ω) to recover s(m,n)s(m,n).

s ( m , n ) = DTFT - 1 { S ( m , ω ) } s ( m , n ) = DTFT - 1 { S ( m , ω ) }

Now, from s(m,n)s(m,n) we wish to recover x(n)x(n) by multiplying each s(m,n)s(m,n) by the shifted window w(n-m·N/2)w(n-m·N/2) and adding the results. We will use the same window used in the forward STFT. Multiplying the m-th windowed block by the shifted window gives:

s ( m , n ) · w ( n - m · N / 2 ) s ( m , n ) · w ( n - m · N / 2 )

which are illustrated in [link]. [link] shows the windowed s(m,n)s(m,n) for m=0,,4m=0,,4.

The next step of the inverse STFT adds these overlapping blocks to obtain the final signal y(n)y(n):

y ( n ) = m s ( m , n ) · w ( n - m · N / 2 ) y ( n ) = m s ( m , n ) · w ( n - m · N / 2 )

We have called this the `inverse' STFT, however, it is only an inverse if y(n)=x(n)y(n)=x(n), which in turn depends on the window w(n)w(n). If the window is not chosen correctly, then the reconstructed signal y(n)y(n) will not be equal to the original signal x(n)x(n).

For the inverse STFT the overlapping signals s(m,n)·w(n-m·N/2)s(m,n)·w(n-m·N/2) are added.

Perfect Reconstruction Condition

How should the window w(n)w(n) be chosen so as to ensure that the `inverse' STFT really is an inverse? Combining [link] and [link] gives

y ( n ) = m x ( n ) · w 2 ( n - m · N / 2 ) . y ( n ) = m x ( n ) · w 2 ( n - m · N / 2 ) .

If we define the squared window function

p ( n ) : = w 2 ( n ) p ( n ) : = w 2 ( n )

then [link] can be written as

y ( n ) = x ( n ) m p ( n - m · N / 2 ) . y ( n ) = x ( n ) m p ( n - m · N / 2 ) .

For perfect reconstruction (y(n)=x(n)y(n)=x(n)) it is necessary that

m p ( n - m N / 2 ) = 1 . m p ( n - m N / 2 ) = 1 .

The half-cycle sine window satisfies this condition as illustrated in [link]. Note that the first N/2N/2 samples and the last N/2N/2 samples are exceptions. The beginning and end of the signal can be inverted using a different procedure or, if the signal is long then these relatively few points at the beginning and end may not matter.

Illustration of the perfect reconstruction condition [link] for the 10-point half-cycle sine window.

Note that the left-hand-side of [link] is periodic with period N/2N/2. Therefore it is necessary to check the condition only for n=0,,N/2-1n=0,,N/2-1 (or for any other N/2N/2 range of n). Moreover, over this interval, only two terms contribute; so the perfect reconstruction simplifies to:

p ( n ) + p ( n + N / 2 ) = 1 , n = 0 , , N 2 - 1 . p ( n ) + p ( n + N / 2 ) = 1 , n = 0 , , N 2 - 1 .

Therefore, the perfect reconstruction condition is

w 2 ( n ) + w 2 ( n + N / 2 ) = 1 , n = 0 , , N 2 - 1 . w 2 ( n ) + w 2 ( n + N / 2 ) = 1 , n = 0 , , N 2 - 1 .

The half-cycle sine window [link] satisfies [link] as illustrated in [link]. Basically it satisfies [link] because of the trigonometric identity cos2(θ)+sin2(θ)=1cos2(θ)+sin2(θ)=1. Many other windows can also be designed that satisfy [link]. Perhaps the simplest N-point window satisfying [link] is the rectangular window,

w ( n ) = 1 2 , n = 0 , , N - 1 , w ( n ) = 1 2 , n = 0 , , N - 1 ,

however, this window is not a good window because it is not tapered (smooth) at its ends. It can therefore cause discontinuities at block boundaries when the STFT is used for noise reduction, signal enhancement, or other applications. This discontinuity is sometimes audible as a low noise.

Illustration of the perfect reconstruction condition [link] for the 10-point half-cycle sine window.

Speech Noise Reduction

The STFT can be used to reduce noise in a speech signal (or other highly oscillatory signal). A simple method consists of three steps:

  1. Compute the STFT of the noisy signal.
    S(m,ω)=STFT{x(n)}S(m,ω)=STFT{x(n)}
  2. Threshold the STFT.
    S2(m,ω)=g(S(m,ω))S2(m,ω)=g(S(m,ω))
    where g(a)g(a) is the thresholding function:
    g(a)=0,|a|Ta,|a|>Tg(a)=0,|a|Ta,|a|>T
    All values less than the threshold T in absolute value get set to zero. center
  3. Compute the inverse STFT.
    y(n)=STFT-1{S2(m,ω)}.y(n)=STFT-1{S2(m,ω)}.

This is illustrated by the block diagram:

center

Example: Speech denoising by STFT-thresholding is illustrated in Figures [link] and [link]. [link] shows a noisy speech signal and its spectrogram computed using 50% overlapping. The noise is visible in both the signal and in the spectrogram. Setting the small spectrogram values to zero (thresholding) results in the spectrogram shown in [link]. Clearly, much of the noise in the spectrogram is eliminated. After this thresholding, the inverse STFT is computed to obtain the denoised signal also shown in [link]. [link] shows a closer view of the noisy and denoised speech signal for closer comparison.

STFT-thresholding is a non-linear noise reduction technique because the thresholding operation is non-linear. It can yield noise reduction results that are not possible with any linear time-invariant (LTI) filter.

This method is sensitive to the choice of threshold T. If the threshold T is too large, then the signal will be highly distorted. If T is too small, then the result will still be noisy.

This basic algorithm described here can be improved in several different ways. Instead of using a fixed threshold T, one can vary the threshold as a function of time and/or frequency. In addition, one can use different nonlinearities instead of the threshold function [link]. Also, one can develop methods to automatically estimate an appropriate threshold value which is needed in practice.

Noisy signal and its STFT (spectrogram). The noise visible in the spectrogram is spread out across time and frequency.
Denoised signal and its STFT (spectrogram). Thresholding the spectrogram eliminates most of the noise.
Noisy and denoised speech signal (magnified view).

Exercises

  1. Write a MATLAB program to implement the STFT with 50% overlapping and a second program to implement its inverse. Verify numerically that the inverse program reconstructs a test signal.
  2. Find another window satisfying the perfect reconstruction condition for 50% overlapping, use it with your STFT program, and verify that it reconstructs a test signal.
  3. Find the perfect reconstruction (PR) condition for an STFT with 2/32/3 overlapping. Find a window that satisfies your PR condition. Write a MATLAB program to implement the STFT and its inverse with 2/3 overlapping and verify the reconstruction of a test signal.
  4. Use the STFT and thresholding to reduce the noise level in a noisy speech signal. Try to use your own speech signal recorded using a computer (most audio recordings with consumer equipment contain some hiss; try to reduce the sound of the hiss). In case your STFT program does not work, you may use the provided function my_stft provided as a .p file. A .p file is a type of MATLAB file that can be executed in MATLAB but whose contents are not viewable (see the pcode function).