I have a reporting table (about 1bn rows), and a tiny dimension table:
CREATE TABLE dbo.Sales_unpartitioned ( BusinessUnit int NOT NULL, (Date) date NOT NULL, SKU varchar(8) NOT NULL, Quantity numeric(10, 2) NOT NULL, Amount numeric(10, 2) NOT NULL, CONSTRAINT PK_Sales_unpartitioned PRIMARY KEY CLUSTERED (BusinessUnit, (Date), SKU) ); --- Demo data: INSERT INTO dbo.Sales_unpartitioned SELECT severity AS BusinessUnit, DATEADD(day, message_id, '2000-01-01') AS (Date), LEFT((text), 3) AS SKU, 1000.*RAND(CHECKSUM(NEWID())) AS Quantity, 10000.*RAND(CHECKSUM(NEWID())) AS Amount FROM sys.messages WHERE (language_id)=1033; --- Artificially inflate statistics of demo data: UPDATE STATISTICS dbo.Sales_unpartitioned WITH ROWCOUNT=1000000000; --- Dimension table: CREATE TABLE dbo.BusinessUnits ( BusinessUnit int NOT NULL, SalesManager nvarchar(250) NULL, PRIMARY KEY CLUSTERED (BusinessUnit) ); INSERT INTO dbo.BusinessUnits (BusinessUnit) SELECT DISTINCT BusinessUnit FROM dbo.Sales;
… to which I’ve added a reporting view used by an application for OLTP-style reporting.
CREATE OR ALTER VIEW dbo.SalesReport_unpartitioned AS SELECT bu.BusinessUnit, s.(Date), s.SKU, s.Quantity, s.Amount FROM dbo.BusinessUnits AS bu CROSS APPLY ( --- Regular sales SELECT t.BusinessUnit, t.(Date), t.SKU, t.Quantity, t.Amount FROM dbo.Sales_unpartitioned AS t WHERE t.BusinessUnit=bu.BusinessUnit AND t.SKU LIKE 'T%' UNION ALL --- This is a special reporting entry. We only --- want to see today's row. In case of duplicates, --- get the row with the first "SKU". SELECT TOP (1) s.BusinessUnit, s.(Date), s.SKU, s.Quantity, s.Amount FROM dbo.Sales_unpartitioned AS s WHERE s.BusinessUnit=bu.BusinessUnit AND s.(Date)=CAST(SYSDATETIME() AS date) AND s.SKU LIKE 'S%' ORDER BY s.BusinessUnit, s.(Date), s.SKU ) AS s
The idea is that the user application will query this view with a SELECT query that filters on a range of dates and one or more BusinessUnits. For this purpose, I’ve chosen a
CROSS APPLY pattern, so that the query can “loop” over each BusinessUnit, seek to a range of Date, and apply a residual filter on SKU.
Example app query:
DECLARE @from date='2021-01-01', @to date='2021-12-31'; SELECT * FROM dbo.SalesReport_unpartitioned WHERE BusinessUnit=16 AND (Date) BETWEEN @from AND @to ORDER BY BusinessUnit, (Date), SKU;
I would expect a query plan that looks like this:
However, the plan turns out like this:
I expected SQL Server to do a “predicate pushdown” on the Date column, allowing the Clustered Index Seek to look for a single BusinessUnit and a range of Date, then apply a residual predicate on SKU. This works on the Seek in the “s” branch (the one with
TOP) – probably because it has a hard-coded Date predicate in the query – but not on the “t” branch.
However, on the “t” branch SQL Server only seeks to the specific BusinessUnit with a residual predicate on SKU, effectively retrieving all dates. Only at the end of the plan does it applies a Filter operator that filters on the Date column.
In a large table, this has a very significant performance penalty – you could end up reading 20 years of data from disk when all you’re looking for is a week.
Things I’ve tried
- Converting the view to an inline table valued function with @fromDate and @toDate parameters that filter the “s” and “t” queries will enable a Seek on (BusinessUnit, Date) as desired, but requires rewriting the app code.
- Moving the
UNION ALLout of the
CROSS APPLY (UNION)to
CROSS APPLY() UNION CROSS APPLY()) will enable predicate pushdown. It makes one more seek on the BusinessUnit table, which is perfectly acceptable.
Fixes the Seek, but changes the results:
- Surprisingly, removing the
ORDER BYfor the “s” query makes predicate pushdown work on “t”, but can give return too many rows from “s”.
UNION ALLby either removing the “s” or “t” query will enable predicate pushdown, but generate incorrect results.
No change or not feasible:
TOP (1)with a
ROW_NUMBER()pattern does not change the Seek.
- Changing the
CROSS APPLYto a forced
INNER LOOP JOINfixes the Seek on “t”, but actually changes “s” to a Scan instead, which is even worse.
- Adding trace flag 8780 to allow the optimizer to work on a plan for longer does not change anything. The plan is already optimized FULL with no early termination.
A common thread seems to be that changing/simplifying the “s” query (removing
ORDER BY) fixes the problem on the “t” query, which feels counter-intuitive to me.
What I’m looking for
I’m trying to understand if this is a shortcoming of the optimizer, if it’s the result of a deliberate costing/optimization mechanism, or if I’ve simply overlooked something.