unity – How to synchronize reads and writes to a StructuredBuffer across warps in HLSL?

I’m currently trying to implement smooth voxel terrain (using dual contouring) with mesh generation happening on the GPU and I’m struggeling with chunked level of detail creation. My approach is as follows:

  1. Always create each chunk’s vertices at the highest level of detail.
  2. Merge the vertices of higher level of detail (using averaging) to create a level of detail which is one level lower than the original level of detail (similarly to merging leaves in an octree).
  3. Repeat until the desired level of detail is reached.

For this purpose I wrote the following kernel, which should do exactly that:

uint cellStride;

RWStructuredBuffer<Vertex> generatedVertices;
RWStructuredBuffer<uint> generatedVertexIndicesLookupTable;

(numthreads(4, 4, 4))
void DecrementLevelOfDetail(uint3 id : SV_DispatchThreadID)
{
    uint3 cellID = id * cellStride + 1;

    uint numberOfChildVertices = 0;
    uint parentVertexIndex = NULL_VERTEX_INDEX;
    Vertex parentVertex = Vertex::Create();

    for (uint cornerIndex = 0; cornerIndex < numberOfCellCorners; cornerIndex++)
    {
        uint3 coordinate = cellID + cellStride / 2 * cellCorners(cornerIndex);
        uint childVertexIndex = generatedVertexIndicesLookupTable(CalculateGeneratedVertexIndicesLookupTableIndex(coordinate, cellStride / 2));

        if (childVertexIndex != NULL_VERTEX_INDEX)
        {
            Vertex childVertex = generatedVertices(childVertexIndex);
            parentVertex.position += childVertex.position;
            parentVertex.normal += childVertex.normal;
            numberOfChildVertices++;
        }
    }

    // I need a sync point here.
    
    if (any(cellID > numberOfVoxels - 3))
    {
        return;
    }

    if (numberOfChildVertices > 0)
    {
       parentVertexIndex = generatedVertices.IncrementCounter();
       parentVertex.position /= numberOfChildVertices;
       parentVertex.normal = normalize(parentVertex.normal);
       generatedVertices(parentVertexIndex) = parentVertex;
    }
    generatedVertexIndicesLookupTable(CalculateGeneratedVertexIndicesLookupTableIndex(cellID, cellStride)) = parentVertexIndex;
}

The generatedVertices buffer contains the vertices to be merged and the generatedVertexIndicesLookupTable buffer acts as a lookup table that – given a cellID – returns the associated index of the vertex in the generatedVertices buffer. Both these buffers were populated by another kernel previously.

To populate these buffers the kernel goes over all cells of a chunks voxel volume and generates a vertex if necessary. If a vertex is created I will do something along these lines to store it:

(numthreads(4, 4, 4))
void GenerateVertices(uint3 cellID: SV_DispatchThreadID)
{
    Vertex vertex;

    // vertex is being created...

    uint vertexIndex = generatedVertices.IncrementCounter();
    generatedVertexIndicesLookupTable(CalculateIndex(cellID)) = vertexIndex;
    generatedVertices(vertexIndex) = vertex;

I dispatch the DecrementLevelOfDetail kernel recursively as follows:

for (int cellStride = 2; cellStride <= (1 << lod); cellStride <<= 1)
{
    m_generatedVerticesBuffer.SetCounterValue(0);
    m_parent.m_voxelConfigs.DualContouringConfig.Compute.SetInt(ComputeShaderProperties.CellStride, cellStride);
    m_parent.m_voxelConfigs.DualContouringConfig.Compute.Dispatch(1, m_parent.m_voxelConfigs.VoxelVolumeConfig.NumberOfCells / cellStride);
}

In order for this to work I need all threads of all warps associated with a specific dispatch call to the DecrementLevelOfDetail kernel to read from the generatedVertices buffer before writing to it at the end of the kernel. Is there a way to achieve that?