Is it possible to detect completely blank audio within Unity?

Hi,

Total newbie here. I am recording audio within Unity to send it to a speech to text (STT) model and use it in game. Unfortunately the STT doesn’t process audio that is completely blank. For example, the user pressed the record button, but doesn’t speak anything.

In this case I want to be able to differentiate between such blank audio and audio that actually has speech in it, so that I can handle it myself before sending it to the STT.

So, is there any way to do it? Also, is there a way to trim out silence parts of a recording and retain only the parts where the user actually spoke?

I tried googling, looked at docs… But it’s hard for me to find anything, cos as I already mentioned I’m a newbie and don’t even understand many terms that are being used.

So, I would appreciate greatly, if someone could point me to some resource which can help me or help me directly.

Thanks a lot

Hi,

We have a scripting example outlining adding a DSP to the master ChannelGroup to create a visualizer: Unity Integration | Scripting Examples - DSP Capture. We can then track the value of mDataBuffer, a non-zero value indicating audio. Please note, that this will track all audio being output as the DSP is on the master channel group, if you only want to track the mic input then the DSP will have to be added to its own channel group.

In terms of only processing an audible signal, an option could be only passing in the sample data when mDataBuffer has a non-zero value.

Hope this helps!

I already have the float sample data with me. But, it seems to have non-zero values for empty audio too. Any idea what could be causing it? Is it just the background noise the mic is picking up and if so, is there a way to handle it?

The recording is done in RuntimeManager.CoreSystem.recordStart so, I think it’s using the Master Channel Group. Could that be the cause of the issue? If so, how can I use a separate channel group for this?

An update: I tried playing the audio that was being recorded. I noticed the audio volume is too low. It is barely audible. I think that’s causing the values to be very similar to empty noise. Any idea why this could be the case and how it can be fixed?

Yes, you can choose a value that the float has to be higher/lower than before you track it e.g. -0.1 0.1, this should ignore background sounds.

Depending on how you are playing the sound you pass to RuntimeManager.CoreSystem.recordStart you may beable to just add the DSP to its channel: FMODUnity.RuntimeManager.CoreSystem.playSound(recSound, mCG, false, out channel /* add the DSP to this channel */);

Could I please get the full code that you are testing?

This is how we record the sound:

public bool StartRecording()
        {
            if (IsRecording)
                return true;

            _activeMicrophone = _microphones.FirstOrDefault(obj => obj.IsConnected);
            if (_activeMicrophone.IsValid == false)
                return false;

            const uint maxLength = 30; // seconds

            _createSoundInfo.cbsize           = Marshal.SizeOf(typeof(CREATESOUNDEXINFO));
            _createSoundInfo.numchannels      = _activeMicrophone.ChannelsCount;
            _createSoundInfo.format           = SOUND_FORMAT.PCMFLOAT;
            _createSoundInfo.defaultfrequency = _activeMicrophone.SampleRate;
            _createSoundInfo.length           = maxLength * (uint)_activeMicrophone.SampleRate * sizeof(float) * (uint)_activeMicrophone.ChannelsCount;

            var result = RuntimeManager.CoreSystem.createSound(
                _createSoundInfo.userdata, MODE.LOOP_NORMAL | MODE.OPENUSER, ref _createSoundInfo, out _sound
            );
            if (result != RESULT.OK)
            {
                StopRecording();
                return false;
            }

            result = RuntimeManager.CoreSystem.recordStart(_activeMicrophone.Index, _sound, true);
            if (result != RESULT.OK)
            {
                StopRecording();
                return false;
            }

            return true;
        }

And, this is how we extract the sound data:

private RecordedAudio ExtractAudioAndRelease()
        {
            _ = _sound.getLength(out var byteCount, TIMEUNIT.PCMBYTES);
            _ = _sound.@lock(0, byteCount, out var readData, out var ptr2, out var readBytes, out var len2);
            Assert.IsTrue(readBytes <= byteCount);

            _ = _sound.getFormat(out _, out _, out var channels, out _);
            _ = _sound.getDefaults(out var frequency, out _);

            // TODO: If the following allocation starts creating problems with GC performance consider implementing a global pool allocator
            var bytes = new byte[readBytes];
            Marshal.Copy(readData, bytes, 0, (int)readBytes);

            _ = _sound.unlock(readData, ptr2, readBytes, len2);
            _ = _sound.release();

            var length = bytes.FindLastIndex(value => value != 0);
            if (length == -1)
                return default;

            return new RecordedAudio {
                Data      = BytesToFloats(bytes, length),
                Frequency = (int)frequency,
                Channels  = channels
            };
        }

Hi,

Thank you for the code. Unfortunately, I cannot see how you are starting the _sound. I believe you could add the filter to the bytes, could you try iterating through readData and only adding the values that are outside of your filter?

Hope this helps.

So, I don’t want to play the sound back to the user. Instead I’m sending the audio data to a STT model which would then transcribe the audio to text. I tried to play the sound back only for debugging purposes, and this is how I did it.

RuntimeManager.CoreSystem.getMasterChannelGroup(out var channelGroup);
RuntimeManager.CoreSystem.playSound(_sound, channelGroup, false, out _);

And, the thing here is I’m not trying to remove silences from the recorded audio… I’m only trying to differentiate between audio which has some speech in it and audio which just has background noise.

The problem seems to be that the values for both the empty audio and speech audio are very similar. Sometimes the empty audio has higher values than the speech audio too. So, I am not able to define a specific threshold to filter out the empty audio.

I suspect this is because of the low volume of the recorded audio. It is barely audible and it is very close to the empty sound. So, if we can find a way to increase the volume of the speech, we can probably differentiate it from the empty audio.

I see, thanks for the info.

Could you try increasing the record level of the mic in the Sound settings of your operating system:
image
Let me know if that helps differentiate between the background noise and actual voice.

It is already at maximum, unfortunately.