Pieter Wuille answered this on Twitter.
The most important complication of cross input aggregation is explained in this Bitcoin dev mailing list post by AJ Towns.
TL;DR: if softforks change which signatures are checked, they mustn’t
change what is aggregated together. This is especially complicated when they interact with BIP341’s OP_SUCCESSx upgrade mechanism, which could easily let future softforks change script semantics entirely. There is nothing fundamentally hard here – it’s just engineering complexity to make sure everything works well together.
Pieter added at a London BitDevs Socratic Seminar on BIP-Taproot:
Graftroot and cross input aggregation are such deeply conceptual changes. You can’t permit building them later. It is such a structural change to how scripts work. These things are not something that can be just added later on top of Taproot. You need a successor. Cross input aggregation, the concept of script verification is no longer a per input thing but it is a per transaction thing. You can’t do it with optimal efficiency, I guess you can invent things. The type of extensibility that is built in is new opcodes, new types of public keys, new sighash types, all these things are made fairly easy and come with almost no downsides compared to not doing them immediately. Real structural changes to script execution, they need something else.