3
$\begingroup$

I'm struggling to find a reliable method to estimate the quality of harmonic vocalizations in African penguins, specifically I'm interested in the 'b' syllable: the longest syllables in their Ecstatic Display Song (EDS), which is composed of multiple elements.

enter image description here enter image description here enter image description here enter image description here

I’m working with recordings like the ones attached and want to assess the relative quality of different calls. I initially tried using Signal-to-Noise Ratio (SNR), but I'm encountering issues because the strong harmonic structure of the calls fills the spectrogram, leaving little to no background for a clean noise estimate.

I’ve attempted to extract a background segment (e.g., 0.2 seconds before the vocalization), but faced two main problems:

  • Overlapping signals: other syllables often occur just before or after the target call, contaminating the background estimate and leading to unreliable SNR values.
  • Fixed-window CNN detections: I also need to assess the quality of calls detected by a CNN model, but these detections are based on fixed-length windows that are not always well-centered on the syllable, making it harder to define signal and noise regions precisely.

Has anyone dealt with a similar issue when working with harmonic-rich signals? I’d really appreciate suggestions for alternative metrics or approaches. I’ve also considered using harmonicity-based measures, such as the Harmonic-to-Noise Ratio (HNR), but I’m not convinced it's the right metric in this context. The problem is that even faint vocalizations (e.g., low amplitude calls) can still produce high HNR values if they are relatively clean, while stronger calls that are slightly masked by background noise might score lower, so the metric doesn't seem to reliably reflect perceived call quality or prominence.

Any ideas or recommendations on how to better capture the signal quality or salience would be greatly appreciated!

Thanks in advance!

$\endgroup$

1 Answer 1

3
$\begingroup$

Maybe there is indeed an alternative approach to these harmonic rich signals.

Very often these type of spectrograms are a product (an artefact) of the spectrogram processing. Whenever you have pulsed signals with a high number of pulses within the spectrogram window, you can get effects like this.

So, I would take these 'b' syllable and zoom into the time series. If they indeed are composed as a sequence of pulses, I would the characterize them by pulses/seconds and amplitude.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.