The input to the network consists of mel-spectrograms, with 599 frames and 128 frequency bins. A mel-spectrograms is a kind of time-frequency representation. It is obtained from an audio signal by computing the Fourier transforms of short, overlapping windows. Each of these Fourier transforms constitutes a frame. These successive frames are then concatenated into a matrix to form the spectrogram. Finally, the frequency axis is changed from a linear scale to a mel scale to reduce the dimensionality, and the magnitudes are scaled logarithmically.