architecture – Azure Event Grid API layer beneath a HTTP API layer

I am part of a project/team that is building a new web app in Azure for the first time, having previously built and developed a traditional three tier ASP.NET web app over a number of years.

We have an external architect/consultant helping with the transition to Azure, and they are proposing an architecture that is proving to be somewhat controversial. In very simple terms the server architecture is basically:

   HTTP/REST API (HTTPTriggers) –> Event Grid –> Back-end Microservices (EventTriggers)

I.e., there is an EventGrid abstraction layer between the externally facing API and the back-end ‘domain’ microservices.

If we take the example of a simple HTTP GET of a data record; the HTTPTrigger C# function sends a ‘command’ event onto the grid, and waits for an ACKnowledge event, before sending the HTTP response back to the caller.

The abstraction layer isn’t super controversial per se, although some have questioned the need for it. There are some benefits I think, such as not having to manage lots of microservice URL endpoints, e.g. if we add a new back-end microservice, or split/merge existing back-end services then the REST API layer can (in principle) be oblivious to these changes at the back-end. There may also be benefits in terms of redundancy and scaling (although one could argue that Azure Functions and use of CosmosDB have those aspects covered without the extra abstraction layer).

The real source of concern is that (as I understand it) the HTTPTrigger function has no way of subscribing to event grid events for its short lifetime (which could/should be sub-second for most/many API calls) in order to receive the ACK event. As such this function sits in a polling loop using ‘await Task.Delay()’ in each loop so as not to sleep the executing thread, or use excessive CPU. We also talked about backing off the polling frequency over time to get a good balance of low latency for fast ACKs, and minimising the number of the polls/loops for slower ACKs.

The polling loop then, checks some appropriate data store, such as a redis cache entry, or a row in an Azure Table DB. Separately, an Azure Function with an EventTrigger has the sole purpose of handling ACK events and updating that data store. As such, the response data from the back-end microservice is conveyed via that storage, which seems a bit odd for a simple GET request. This use of storage, combined with polling, will add cost and latency, and I think the controversy is largely due to not seeing a clear benefit to counter those issues/costs.

One thing I was wondering about is how this pattern would work if the Azure Functions were geo-distributed (ie., e.g. if a customer wanted distribution of the back-end over two or more data centers); could we configure Azure such that the storage used for the event responses was always local to where the Azure functions were running? I don’t think so because there are two Azure Functions – the HTTPTrigger and the EventTrigger, and I’m not sure there is any way of ensuring that the EventTrigger function will run in the same locality as the original HTTPTrigger function – they are two completely independent functions. As such, the state store would need to be geo-distributed in that scenario, which sounds a little crazy to me to have a HTTP response being transmitted via the data replication/synchronisation mechanism of Azure Tables or redis.


Thanks for reading!