Sandy noise while streaming audio/voice

So I had this problem for a while. I am streaming voice audio and I constantly receiving some kind of sandy noise glitches.
My setup. I am receiving updates from voice chat for example each ~10ms. I am writing that data to the ring buffer. I then read from FMOD using ReadPCMDataCallback ~20ms (default callback rate). Everything is good, but I keep hearing that “sandy” artefacts. Maybe you know why they are happening.
This is my stripped test case with removed fast forward and other stuff. I am calling ReceivedAudioCallback every second providing 1 second worth of samples. Just constant noise. And I still keep hearing those sandy artefacts. My guess it is some kind of thread hanging in between PCM callbacks?
Attaching audio as well, maybe it will indicate something too?

/// <summary>
/// Bridges Discord voice data to FMOD audio playback.
/// </summary>
public class DiscordToFMOD
{
    // channel/sample rate defaults
    private const int DEFAULT_CHANNEL_COUNT = 2;
    private const int DEFAULT_SAMPLE_RATE = 48000;
    private const int PCM16_SIZE = sizeof(short);

    // Ring buffer and FMOD buffer configuration
    // This should give enough for both delay and jitter and update rate differences
    private const int SOUND_BUFFER_MS = 10000;//It was 2000 before
    private const int MAX_VOICE_DELAY_MS = 100;
    private const int DECODE_BUFFER_MS = 20;

    // This is the size of both fmod ring buffer (probably) and our own ring buffer used for writing from discord
    // We will create a buffer which will loop around with playing sound. We will write at X and read it with a bit of delay.
    private const int SOUND_BUFFER_SIZE = DEFAULT_SAMPLE_RATE * DEFAULT_CHANNEL_COUNT * PCM16_SIZE / 1000 * SOUND_BUFFER_MS;
    // Instead of looping with % SOUND_BUFFER_SIZE we will be doing cheaper & soundBufferMask operation, but for that we need power of two buffer
    private static int SoundBufferSizeRounded = SOUND_BUFFER_SIZE;
    private static int SoundBufferMask = SOUND_BUFFER_SIZE - 1;
    private const int MAX_VOICE_DELAY = DEFAULT_SAMPLE_RATE * DEFAULT_CHANNEL_COUNT * PCM16_SIZE / 1000 * MAX_VOICE_DELAY_MS;
    private const int DECODE_BUFFER_SIZE = DEFAULT_SAMPLE_RATE * DECODE_BUFFER_MS / 1000;

    private ConcurrentDictionary<ulong, AudioUser> _audioUsers = new ConcurrentDictionary<ulong, AudioUser>();
    private readonly NBG.Core.Logger Logger;

    public DiscordToFMOD(NBG.Core.Logger Logger)
    {
        //Main thread
        //NBG.Core.Log.LogInfo("DiscordToFMOD thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

        this.Logger = Logger;
        //This is ring buffer size in FMOD.
        //Read everything about it before changing.
        //https://www.fmod.com/docs/2.00/api/core-api-system.html#system_setdspbuffersize
        //RuntimeManager.CoreSystem.setDSPBufferSize(CALLBACK_RATE, 4);

        //We are looing ring buffer with & operation instead of % as it is cheaper, but for that we need power of two buffer size
        int frameSize = DEFAULT_CHANNEL_COUNT * PCM16_SIZE;
        SoundBufferSizeRounded = math.ceilpow2(SOUND_BUFFER_SIZE);
        SoundBufferSizeRounded = (SoundBufferSizeRounded / frameSize) * frameSize;
        SoundBufferMask = SoundBufferSizeRounded - 1;
    }

    public void Destroy()
    {
        //Main thread
        //NBG.Core.Log.LogInfo("Destroy thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

        Dispose();
    }

    public void Dispose()
    {
        //Can be main thread; Can be some random discord thread
        //NBG.Core.Log.LogInfo("Dispose thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

        foreach (var audioUser in _audioUsers.Values)
        {
            if (audioUser != null)
            {
                audioUser.Dispose();
            }
        }
        _audioUsers.Clear();
    }

    /// <summary>
    /// Called when a remote participant is added or removed.
    /// Triggered on each remote participant.
    /// So if you will join existing room it will trigger for each already present member
    /// It will trigger one per new client joined as well.
    /// Not including yourself
    /// </summary>
    /// <param name="userId"></param>
    /// <param name="added"></param>
    public void OnRemoteParticipantStateChanged(ulong userId, bool added)
    {
        //Main thread
        //NBG.Core.Log.LogInfo("OnParticipantChanged thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

        if (added)
        {
            if (!_audioUsers.ContainsKey(userId))
            {
                var audioUser = new AudioUser();
                var success = _audioUsers.TryAdd(userId, audioUser);
            }
        }
        else
        {
            if (_audioUsers.TryRemove(userId, out var audioUser))
            {
                if (audioUser != null)
                {
                    audioUser.Dispose();
                    audioUser = null;
                }
            }
        }
    }

    /// <summary>
    /// Receives PCM audio from Discord and writes it into the user's buffer.
    /// This is not constant stream but it will be received from user only then user voice is active
    /// </summary>
    /// <param name="userId"></param>
    /// <param name="data"></param>
    /// <param name="samplesPerChannel"></param>
    /// <param name="sampleRate"></param>
    /// <param name="channels"></param>
    public void ReceivedAudioCallback(
        ulong userId,
        IntPtr data,
        ulong samplesPerChannel, // for example: 480/960, etc..
        int sampleRate, // default: 48000
        ulong channels) // default: 2
    {
        //Some random discord thread
        //NBG.Core.Log.LogInfo("ReceivedAudioCallback thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);
        if (data == IntPtr.Zero)
        {
            NBG.Core.Asserts.IsTrue(false, "Discord returned null buffer.");
            return;
        }

        // We are assuming for now that discord will always return same sampleRate and channelCount
        // In reality sometimes we are receiving different, but for now just ignoring as we are not setting
        if (sampleRate != DEFAULT_SAMPLE_RATE || channels != DEFAULT_CHANNEL_COUNT)
        {
            NBG.Core.Log.LogWarningFormat("Discord non default sampleRate:{0} channelCount:{1}", sampleRate, channels);
            return;
        }

        // Dont do anything if audioUser is not present
        if (!_audioUsers.TryGetValue(userId, out var audioUser) || audioUser == null || audioUser.AudioBuffer == null)
        {
            return;
        }

        //We will write everything in two batches. End of buffer and start of buffer if looping is happening
        int frameSize = DEFAULT_CHANNEL_COUNT * PCM16_SIZE;
        //Discord should always return aligned bytes so no need to correct
        int byteCount = (int)(samplesPerChannel * channels * PCM16_SIZE);

        unsafe
        {
            int writePos = audioUser.WritePosition;
            NBG.Core.Asserts.IsTrue(writePos % frameSize == 0);
            int firstPart = Math.Min(byteCount, SoundBufferSizeRounded - writePos);
            int secondPart = byteCount - firstPart;

            fixed (byte* bufferPtr = audioUser.AudioBuffer)
            {
                byte* dstPtr = bufferPtr + writePos;
                byte* srcPtr = (byte*)data;

                Buffer.MemoryCopy(srcPtr, dstPtr, firstPart, firstPart);

                if (secondPart > 0)
                {
                    Buffer.MemoryCopy(srcPtr + firstPart, bufferPtr, secondPart, secondPart);
                }
            }
        }

        // How much bytes are still not read
        // This is the only atomic way of doing it
        Interlocked.Add(ref audioUser._availableBytes, byteCount);
        // Pointer where buffer is at for writing
        audioUser.WritePosition = (audioUser.WritePosition + byteCount) & SoundBufferMask;
    }


    /// <summary>
    /// Per-user FMOD audio bridge.
    /// </summary>
    private class AudioUser
    {
        private static readonly ConcurrentDictionary<IntPtr, AudioUser> _eventToUserMap = new();
        private static readonly ConcurrentDictionary<IntPtr, AudioUser> _soundToUserMap = new();

        // This is looping audio buffer for each user discord/fmod link
        // We are writing from discord thread and reading from fmod thread and applying that we read to fmod buffer
        public byte[] AudioBuffer;
        // Discord writes at this buffer byte position
        // Only used by discord thread
        public int WritePosition;
        // FMOD reads at this buffer position
        // Only used by FMOD thread
        public int ReadPosition;

        //How much bytes are already written written by discord, but not read by FMOD
        internal int _availableBytes;
        public int AvailableBytes
        {
            get => Interlocked.CompareExchange(ref _availableBytes, 0, 0);
            set => Interlocked.Exchange(ref _availableBytes, value);
        }

        private EventInstance FMODEventInstance;
        private Sound FMODCustomSound;
        private EVENT_CALLBACK FMODEventCallback;
        private GCHandle FMODEventHandle;
        private SOUND_PCMREAD_CALLBACK PCMReadCallback;
        private GCHandle PCMReadHandle;
        private byte[] CustomSoundDataBuffer;
        private GCHandle CustomSoundDataBufferHandle;

        public AudioUser()
        {
            //Main thread
            //NBG.Core.Log.LogInfo("AudioUser thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

            AudioBuffer = ArrayPool<byte>.Shared.Rent(SoundBufferSizeRounded);
            Array.Clear(AudioBuffer, 0, SoundBufferSizeRounded);

            // Setup FMOD event callback (keep handle alive for GC)
            FMODEventCallback = new EVENT_CALLBACK(VoiceFMODEventCallback);
            FMODEventHandle = GCHandle.Alloc(FMODEventCallback);

            // Settings callbacks on FMOD side for playing sound
            var discordConfig = DiscordConfig.GetOrCreate();
            FMODEventInstance = RuntimeManager.CreateInstance(discordConfig.AudioForVoiP);
            if (!FMODEventInstance.isValid())
            {
                throw new Exception("Discord FMOD event is not valid!");
            }

            _eventToUserMap[FMODEventInstance.handle] = this;
            FMODEventInstance.setCallback(FMODEventCallback, EVENT_CALLBACK_TYPE.CREATE_PROGRAMMER_SOUND);
            //Start sound. it will loop infinitely
            FMODEventInstance.start();
        }

        public void Dispose()
        {
            //Main thread
            //NBG.Core.Log.LogInfo("Dispose thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

            if (AudioBuffer != null)
            {
                ArrayPool<byte>.Shared.Return(AudioBuffer);
                AudioBuffer = null;
            }
            if (FMODEventHandle.IsAllocated)
            {
                FMODEventHandle.Free();
            }
            if (FMODEventInstance.isValid())
            {
                _eventToUserMap.TryRemove(FMODEventInstance.handle, out _);
                FMODEventInstance.stop(FMOD.Studio.STOP_MODE.IMMEDIATE);
                FMODEventInstance.release();
            }
            if (FMODCustomSound.hasHandle())
            {
                FMODCustomSound.release();
            }
            if (PCMReadHandle.IsAllocated)
            {
                PCMReadHandle.Free();
            }
            if (CustomSoundDataBufferHandle.IsAllocated)
            {
                CustomSoundDataBufferHandle.Free();
                _soundToUserMap.TryRemove(FMODCustomSound.handle, out _);
            }
            if (CustomSoundDataBuffer != null)
            {
                ArrayPool<byte>.Shared.Return(CustomSoundDataBuffer);
                CustomSoundDataBuffer = null;
            }
        }

        [AOT.MonoPInvokeCallback(typeof(EVENT_CALLBACK))]
        private static FMOD.RESULT VoiceFMODEventCallback(EVENT_CALLBACK_TYPE type, IntPtr _event, IntPtr parameterPtr)
        {
            //Some random FMOD thread
            //NBG.Core.Log.LogInfo("VoiceFMODEventCallback thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

            // Recover the instance
            if (!_eventToUserMap.TryGetValue(_event, out var self))
            {
                return RESULT.OK;
            }
            return self?.HandleEventCallback(type, _event, parameterPtr) ?? FMOD.RESULT.ERR_INVALID_PARAM;
        }

        private FMOD.RESULT HandleEventCallback(EVENT_CALLBACK_TYPE type, IntPtr _event, IntPtr parameterPtr)
        {
            if (type == EVENT_CALLBACK_TYPE.CREATE_PROGRAMMER_SOUND)
            {
                var parameters = (PROGRAMMER_SOUND_PROPERTIES)Marshal.PtrToStructure(parameterPtr, typeof(PROGRAMMER_SOUND_PROPERTIES));

                //Create those callbacks only once
                if (!FMODCustomSound.hasHandle())
                {
                    CREATESOUNDEXINFO exInfo = new()
                    {
                        cbsize = Marshal.SizeOf(typeof(CREATESOUNDEXINFO)),
                        numchannels = DEFAULT_CHANNEL_COUNT,
                        defaultfrequency = DEFAULT_SAMPLE_RATE,
                        format = SOUND_FORMAT.PCM16,
                        length = (uint)SoundBufferSizeRounded,
                        decodebuffersize = DECODE_BUFFER_SIZE,//Will fire each X size (960 == 20ms at 48kHz)
                    };

                    //Create event and assign handle to prevent GC
                    PCMReadCallback = new SOUND_PCMREAD_CALLBACK(ReadPCMDataCallback);
                    PCMReadHandle = GCHandle.Alloc(PCMReadCallback);
                    exInfo.pcmreadcallback = PCMReadCallback;

                    //Custom data buffer provided for actual sound inside FMOD.
                    CustomSoundDataBuffer = ArrayPool<byte>.Shared.Rent(SoundBufferSizeRounded);
                    CustomSoundDataBufferHandle = GCHandle.Alloc(CustomSoundDataBuffer, GCHandleType.Pinned);

                    var result = RuntimeManager.CoreSystem.createSound(
                        CustomSoundDataBufferHandle.AddrOfPinnedObject(),
                        MODE.OPENUSER | MODE.LOOP_NORMAL | MODE.CREATESTREAM,
                        ref exInfo,
                        out FMODCustomSound
                    );
                    _soundToUserMap[FMODCustomSound.handle] = this;
                    if (result != RESULT.OK)
                    {
                        NBG.Core.Log.LogError("FMOD sound creation failed: " + result);
                    }
                }

                parameters.sound = FMODCustomSound.handle;
                Marshal.StructureToPtr(parameters, parameterPtr, false);
            }

            return FMOD.RESULT.OK;
        }

        [AOT.MonoPInvokeCallback(typeof(SOUND_PCMREAD_CALLBACK))]
        private static RESULT ReadPCMDataCallback(IntPtr soundPtr, IntPtr data, uint dataLength)
        {
            //Some random FMOD thread
            //NBG.Core.Log.LogInfo("ReadPCMDataCallback thread: " + System.Threading.Thread.CurrentThread.ManagedThreadId);

            var sound = new FMOD.Sound(soundPtr);
            if (!_soundToUserMap.TryGetValue(sound.handle, out var self) || self == null)
            {
                return RESULT.OK;
            }

            return self.ReadPCMDataInternal(data, dataLength);
        }

        private RESULT ReadPCMDataInternal(IntPtr data, uint dataLength)
        {
            if (dataLength == 0 || data == IntPtr.Zero)
            {
                NBG.Core.Log.LogWarningFormat("ReadPCMDataInternal, but couldnt proceed. AvailableBytes:{0}. dataLength:{1} dataZero:{2}", AvailableBytes, dataLength, (data == IntPtr.Zero));
                return RESULT.OK;
            }

            // Check how much bytes were written by discord
            int available = AvailableBytes;
            int frameSize = DEFAULT_CHANNEL_COUNT * PCM16_SIZE;

            // If too much delay, skip ahead
            /*int skip = available - MAX_VOICE_DELAY;
            skip = (skip / frameSize) * frameSize; // round down to nearest frame
            if (skip > 0)
            {
                NBG.Core.Log.LogWarning("Too big delay, forcefully jumping to correct delays, but might cause some artifacts.");
                ReadPosition = (ReadPosition + skip) & SoundBufferMask;
                NBG.Core.Asserts.IsTrue(ReadPosition % frameSize == 0);
                Interlocked.Add(ref _availableBytes, -skip);
                available -= skip;
            }*/

            int readNow = (int)Math.Min(available, dataLength);
            readNow = (readNow / frameSize) * frameSize;
            NBG.Core.Asserts.IsTrue(readNow % frameSize == 0);

            unsafe
            {
                fixed (byte* bufferPtr = AudioBuffer)
                {
                    byte* srcPtr = bufferPtr + ReadPosition;
                    byte* dstPtr = (byte*)data;

                    int firstPart = Math.Min(readNow, SoundBufferSizeRounded - ReadPosition);
                    int secondPart = readNow - firstPart;

                    Buffer.MemoryCopy(srcPtr, dstPtr, firstPart, firstPart);

                    if (secondPart > 0)
                    {
                        Buffer.MemoryCopy(bufferPtr, dstPtr + firstPart, secondPart, secondPart);
                    }

                    // Fill remaining buffer with zeros (silence)
                    /*if (readNow < dataLength)
                    {
                        for (int i = readNow; i < dataLength; i++)
                        {
                            dstPtr[i] = 0;
                        }
                    }*/
                }
            }

            // Update read position of discord buffer
            ReadPosition = (ReadPosition + readNow) & SoundBufferMask;
            NBG.Core.Asserts.IsTrue(ReadPosition % frameSize == 0);
            // Reduce available bytes by the amount we are reading this time
            Interlocked.Add(ref _availableBytes, -readNow);

            return RESULT.OK;
        }
    }
}

I was doing some experimentation with values. Without much changes to anything else the best so far I get is increasing decodebuffersize. From experimentation I picked x4 size of discord update rate and x2 size of FMOD callback rate. Increasing it even further creates unwanted delay, but decreasing causes some choppy/sandy artefacts. Honestly it is more experimentation than a good fix, which I dont like. I am still keen on hearing other opinions

Hi,

Thank you for the audio and the code. Could I please grab your FMOD and Unity versions?

While I investigate, I will pass on some forums with useful information:
Solving some VOIP issues:

Play back audio with minimal delay:

We also have our Unity Integration | Examples Video Playback which outlines how to deal with provider/receiver audio.

Thanks for the support, we uncovered the issue. It was on the Studio side. I had not set the programmer instrument to async loop and so the regular “pop” we heard was the instrument restarting. In hindsight this was extremely obvious!

Thank you for sharing the solution!