May 022017
 

For the past few months, one my customers had been trying to diagnose an issue with SQL Server paging out the entire buffer pool each morning around 8AM. See the error below.

Not only was this causing some serious performance issues, but it was also affecting every SQL Server in the environment.

I began my investigation with the standard troubleshooting techniques for buffer pool paging. One of the main workarounds that’s recommended is to enable the “locked pages in memory” permission for the SQL Server service account; however, this was not possible for this customer. Enabling that permission would violate one of their compliance rules, so I needed to find another solution.

The next stop on the investigation trail is the sys.dm_os_ring_buffers DMV. This is a great tool to help determine if the memory pressure came from within SQL Server or from the Windows OS.

WITH RingBuffer AS (
  SELECT
     CAST(dorb.record AS XML) AS xRecord
    ,dorb.timestamp
  FROM sys.dm_os_ring_buffers AS dorb
  WHERE dorb.ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR'
 )
SELECT
   DATEADD(ms, rb.timestamp-dosi.ms_ticks, GETDATE()) AS 'DateTime'
  ,xr.value('(ResourceMonitor/Notification)[1]', 'varchar(75)') AS RmNotification
  ,xr.value('(ResourceMonitor/IndicatorsProcess)[1]','tinyint') AS IndicatorsProcess
  ,xr.value('(ResourceMonitor/IndicatorsSystem)[1]','tinyint') AS IndicatorsSystem
FROM RingBuffer AS rb
CROSS APPLY rb.xRecord.nodes('Record') record (xr)
CROSS JOIN sys.dm_os_sys_info AS dosi
ORDER BY 'DateTime' DESC;

Matching the time of our Errorlog entry, we can see the resource monitor reported RESOURCE_MEMPHYSICAL_LOW with an IndicatorsProcess of 0 and IndicatorsSystem of 2. This helps us determine that SQL Server was not the cause of memory pressure and that the issue was caused by another process running on the server. Since these servers are dedicated to SQL Server, I know there are very few applications installed that should be competing with SQL Server for memory.

The next place to check was the Windows event logs. In the Application Log, I found the buffer pool entry by SQL Server (EventID 17890), and then proceeded to search backwards in time. A few minutes before SQL’s error, I noticed this entry.

What is the McShield service? A quick Bing search revealed that it’s one of the services for McAfee VirusScan Enterprise. Could this be the cause? To get a quick look at all the history, I filtered the application log for event IDs: 17890 and 5000. Each time McAfee got an updated virus DAT file, SQL Server soon followed that by paging out the entire buffer pool. I checked the application log on several other SQL Servers for the same event IDs, and sure enough the same events occurred in tandem each morning. I also got confirmation from the security administration team that McAfee is scheduled to check for new DAT files each morning around 8AM. Eureka!

This seems like it could be the cause of our paging, but a little more research is needed. Searching the McAfee knowledge base, lead me to this article about the “Processes on enable” option.

Enabling this option causes memory pages of running processes to get paged to disk. And the example given is “Oracle, SQL, or other critical applications that need to be memory-resident continually, will have their process address space paged to disk when scan Processes On Enable kicks in”. OUCH! So when the McAfee service starts up or it gets a new DAT file, it will page out all processes.

To verify this was really the cause, I needed to setup a test environment. I installed McAfee VirusScan Enterprise (VSE 8.8 Patch 7) on a Windows 2012 R2 server running SQL Server 2016, and then setup HammerDB to generate some activity against the server.

I configured Performance Monitor to capture the “Process:Working Set” and “Process:Page Faults/sec” for the SQL Server process (sqlservr.exe). Once HammerDB had a little bit of time to get running, I manually issued a DAT file update through the McAfee GUI.

Once completed, I checked the Windows Application Log and I could see event ID 5000 that said McShield service had restarted with the newer DAT version 8506.

About 90 seconds later, SQL Server’s buffer pool was paged out, as evident by event ID 17890.

The McAfee knowledge base article said this would happen when a new DAT file is received or when the McShield server starts up, so I decided to manually stop and restart the McShield service. As expected, I saw the same two event IDs in the application log. Below is the data collected by Performance Monitor for both of those tests.

The red line is the Working Set of SQL Server and the blue line is the Page Faults/sec for SQL Server. The first dip in the red line was from the DAT file update, and the second dip was when I manually restarted the McShield service. Since SQL Server’s buffer pool had just been paged out, it needed to read those pages from disk which results in a page fault.

The next step was to verify McAfee had all the correct exclusions for SQL Server.

Now that we know the exclusions were set properly, we can assume the issue is caused by the “Processes on enable” setting as previously mentioned. Let’s turn off that setting and rerun our tests

After giving HammerDB a few minutes to generate some activity, I issue a manual DAT file update through the McAfee GUI. The application log shows the Event ID 5000 with a new DAT version of 8507, but no record of Event ID 17890 and there was nothing in the SQL Errorlog about memory paging.

Next, we’ll test the manual restart of the McShield service. And we get the same result; Event ID 5000 but no indication of memory paging. Looking at perfmon, the counters show SQL Server’s Working Set and Page Faults were unchanged during the tests.

Once we completed these tests, we took that data back to the security team and asked them to turn off that setting within McAfee. They obliged and so we waited and monitored our SQL Servers for the next few weeks. During that time, we did not have any SQL Server page out memory during the times that McAfee was getting new DAT updates.

After doing a little more research on McAfee VSE, I discovered this setting “Processes on enable” is enabled if you install McAfee with the Maximum Protection level. The Standard Protection level does not have this option enabled.

The key takeaway from this whole experience is that even though a program may be a standard install for all servers doesn’t mean it won’t impact SQL Server. And even though it may have been fully vetted in the past, a change to its settings could impact SQL Server in the future.

The lab tests were done using McAfee VirusScan Enterprise 8.8 Patch 7. I know this behavior is still the same in VSE Patch 8, so I would assume it’s the same in the other patch levels as well.

Additional resources:
https://kc.mcafee.com/corporate/index?page=content&id=KB76157
https://blogs.msdn.microsoft.com/karthick_pk/2012/06/22/a-significant-part-of-sql-server-process-memory-has-been-paged-out/
https://support.microsoft.com/en-us/help/2028324/how-to-troubleshoot-sql-server-error-message-17890

Jul 072015
 

Recently, I ran into a critical error while I was helping a customer troubleshoot an issue in SQL Server. That may not sound like a big deal, but we were installing Cumulative Update 6 for SQL Server 2012 SP2 to fix our initial problem when we encountered the following error.

What gives SQL Server? We just upgraded several other instances on this same server. Why does this one have to fail right now? Well, a quick peak in the errorlog sure points me in the right direction.

Starting execution of SSIS_HOTFIX_INSTALL.SQL
:
:
Error: 9002, Severity: 17, State: 4.
The transaction log for database ‘SSISDB’ is full due to ‘ACTIVE_TRANSACTION’.

Error: 912, Severity: 21, State: 2.
Script level upgrade for database ‘master’ failed because upgrade step ‘SSIS_hotfix_install.sql’ encountered error 3602, state 251, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.

Error: 3417, Severity: 21, State: 3.
Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

SQL Server shutdown has been initiated.

This is definitely not good. Here we are trying to apply CU6 to fix one problem, and it we get another one in return. Needless to say, the CU6 upgrade failed, and SQL Server was dead in the water. Restarting the SQL Server service will just continue to run the internal upgrade and return the same error message.

It’s easy to see the root cause of the problem; the transaction log for the SSISDB database is full. But if we can’t start the service how will we ever get the error fixed? Lucky for us, Microsoft has a trace flag for that. Trace Flag 902 will bypass the internal upgrade script on startup. Just open SQL Server Configuration Manager and add -T902 to the Startup Parameters, save the changes, and then start the service.

SQL Server started up just fine and fully recovered all databases; including SSISDB. However, there was a warning about 1 transaction being rolled back for SSISDB.

1 transactions rolled back in database ‘SSISDB’ (5:0). This is an informational message only. No user action is required.

This is most likely from the SSIS_HOTFIX_INSTALL.SQL script that failed during the upgrade. Next, we need to check the configuration and the file sizes of the SSISDB database. Sure enough, the log file is still at its original size of 1MB with autogrowth was disabled.

For now, we can enable autogrowth for the log file and then save the changes. Next, we’ll need remove the T902 trace flag from the Startup Parameters, and then restart the SQL Server service. Finally, SQL Server was able to startup without any errors. In the errorlog we can see the internal upgrade script ran successfully.

Starting execution of SSIS_HOTFIX_INSTALL.SQL
Start applying SSIS_HOTFIX_INSTALL changes.
:
:
Schema build in SSISDB has been updated to 11.0.5592.0
Execution of SSIS_HOTFIX_INSTALL.SQL completed

At this point, we need to rerun the CU6 setup again, so it can verify everything was correctly installed.

It shows the current patch level at 11.2.5592.0, which is correct, but the status is “Incompletely installed”. Just click next to continue with the upgrade to clear out the previous install failures.

So what is the lesson learned from this error? Should we leave autogrowth enabled on all databases before running a patch? Not necessarily, but it is an option. I think a better option is to keep a mental note of Trace Flag 902. With that trace flag, we were able to get SQL Server online long enough to correct the root problem, and we didn’t have to worry about changing any settings on a database.

References: