Cannot fetch a row from OLE DB provider “SQLNCLI11” for linked server

I have two SQL Servers on two different machines somewhere in the cloud (maybe Azure).

One is Microsoft SQL Server 2012 (SP3-CU10) (KB4025925) – 11.0.6607.3 (X64)
Jul 8 2017 16:43:40
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)

On this server there is a link to the second server.

The second server (aae-sqldw-02) is Microsoft SQL Server 2016 (SP1-CU15-GDR) (KB4505221) – 13.0.4604.0 (X64)
Jun 15 2019 07:56:34
Copyright (c) Microsoft Corporation
Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2016 Datacenter 10.0 (Build 14393: ) (Hypervisor)

On the first server we are running a “simple” query:

TRUNCATE TABLE (dbo).(LocalTable)

INSERT INTO (dbo).(LocalTable)
    ((DatabaseName)
    ,(SalesContractNumber)
    ,... 60 columns
    )
SELECT
    convert(varchar(128), DatabaseName) collate Latin1_General_CI_AS
    ,convert(varchar(60), SalesContractNumber) collate Latin1_General_CI_AS
    ,... 60 columns
FROM (aae-sqldw-02).(Fin_DWH).(dbo).(RemoteView)
WHERE DatabaseName = 'somename'

This query sometimes fails with an error:

Cannot fetch a row from OLE DB provider "SQLNCLI11" for linked server "aae-sqldw-02".

or with this error:

Cannot fetch the rowset from OLE DB provider "SQLNCLI11" for linked server "aae-sqldw-02". .

I know that the second server is under a very heavy load most of the day. It literally maxes out its disk IO (255MB/sec). The brute-force solution is to simply move it to a more expensive plan with more IO. This change needs a lot of bureaucracy and will take a long time. Besides, there is no guarantee that the next tier will be enough.

server load

Is there anything I can do with the given resources right now?

When a query completes successfully, it can take between 1-3 hours. The query returns about 3M rows, about 4GB of data, so not too much.

When a query fails with Cannot fetch a row, last few times it failed after 9294 seconds (2.5 hours), 12326 seconds (3.5 hours).

When a query fails with Cannot fetch the rowset, it failed after 606 seconds, 611 seconds.

So, 600 seconds suggest some default 10min timeout (for connection?)
In those cases when connection succeeded, it started fetching the data, but failed in the process. Maybe the linked server could not send the next row fast enough and some other timeout kicked in.

When a query succeeded, it took 3841 seconds last time.

Here are the settings for the linked server:

EXEC master.dbo.sp_addlinkedserver @server = N'aae-sqldw-02', @srvproduct=N'SQL Server'
EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N'aae-sqldw-02',@useself=N'True',@locallogin=NULL,@rmtuser=NULL,@rmtpassword=NULL
GO

EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'collation compatible', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'data access', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'dist', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'pub', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'rpc', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'rpc out', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'sub', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'connect timeout', @optvalue=N'0'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'collation name', @optvalue=null
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'lazy schema validation', @optvalue=N'false'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'query timeout', @optvalue=N'0'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'use remote collation', @optvalue=N'true'
EXEC master.dbo.sp_serveroption @server=N'aae-sqldw-02', @optname=N'remote proc transaction promotion', @optvalue=N'true'

What do you think, would it make any difference if I explicitly set the query timeout option to something like 5 hours? Can it make things worse?

Obviously, the proper fix would be to look at what is going on at the server and optimize the queries to reduce the overall load, but is there anything I can do at a higher server / database level, so that the query completes, even if it takes a really long time?

We need to run this query once a week and right now we have to retry it several times until it completes successfully.