Tuesday, May 2, 2017

Dear McAfee, Why are you destroying my buffer pool?

For the past few months, one my customers had been trying to diagnose an issue with SQL Server paging out the entire buffer pool each morning around 8AM. See the error below.


Not only was this causing some serious performance issues, but it was also affecting every SQL Server in the environment.

I began my investigation with the standard troubleshooting techniques for buffer pool paging. One of the main workarounds that’s recommended is to enable the “locked pages in memory” permission for the SQL Server service account; however, this was not possible for this customer. Enabling that permission would violate one of their compliance rules, so I needed to find another solution.

The next stop on the investigation trail is the sys.dm_os_ring_buffers DMV. This is a great tool to help determine if the memory pressure came from within SQL Server or from the Windows OS.

WITH RingBuffer AS (
  SELECT
     CAST(dorb.record AS XML) AS xRecord
    ,dorb.timestamp
  FROM sys.dm_os_ring_buffers AS dorb
  WHERE dorb.ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR'
)
SELECT
   DATEADD(ms, rb.timestamp-dosi.ms_ticks, GETDATE()) AS 'DateTime'
  ,xr.value('(ResourceMonitor/Notification)[1]', 'varchar(75)') AS RmNotification
  ,xr.value('(ResourceMonitor/IndicatorsProcess)[1]','tinyint') AS IndicatorsProcess
  ,xr.value('(ResourceMonitor/IndicatorsSystem)[1]','tinyint') AS IndicatorsSystem
FROM RingBuffer AS rb
CROSS APPLY rb.xRecord.nodes('Record') record (xr)
CROSS JOIN sys.dm_os_sys_info AS dosi
ORDER BY 'DateTime' DESC;
GO


Matching the time of our Errorlog entry, we can see the resource monitor reported RESOURCE_MEMPHYSICAL_LOW with an IndicatorsProcess of 0 and IndicatorsSystem of 2. This helps us determine that SQL Server was not the cause of memory pressure and that the issue was caused by another process running on the server. Since these servers are dedicated to SQL Server, I know there are very few applications installed that should be competing with SQL Server for memory.

The next place to check was the Windows event logs. In the Application Log, I found the buffer pool entry by SQL Server (EventID 17890), and then proceeded to search backwards in time. A few minutes before SQL’s error, I noticed this entry.


What is the McShield service? A quick Bing search revealed that it’s one of the services for McAfee VirusScan Enterprise. Could this be the cause? To get a quick look at all the history, I filtered the application log for event IDs: 17890 and 5000. Each time McAfee got an updated virus DAT file, SQL Server soon followed that by paging out the entire buffer pool. I checked the application log on several other SQL Servers for the same event IDs, and sure enough the same events occurred in tandem each morning. I also got confirmation from the security administration team that McAfee is scheduled to check for new DAT files each morning around 8AM. Eureka!

This seems like it could be the cause of our paging, but a little more research is needed. Searching the McAfee knowledge base, lead me to this article about the “Processes on enable” option.

Enabling this option causes memory pages of running processes to get paged to disk. And the example given is “Oracle, SQL, or other critical applications that need to be memory-resident continually, will have their process address space paged to disk when scan Processes On Enable kicks in”. OUCH! So when the McAfee service starts up or it gets a new DAT file, it will page out all processes.

To verify this was really the cause, I needed to setup a test environment. I installed McAfee VirusScan Enterprise (VSE 8.8 Patch 7) on a Windows 2012 R2 server running SQL Server 2016, and then setup HammerDB to generate some activity against the server.

I configured Performance Monitor to capture the “Process:Working Set” and “Process:Page Faults/sec” for the SQL Server process (sqlservr.exe). Once HammerDB had a little bit of time to get running, I manually issued a DAT file update through the McAfee GUI.


Once completed, I checked the Windows Application Log and I could see event ID 5000 that said McShield service had restarted with the newer DAT version 8506.


About 90 seconds later, SQL Server’s buffer pool was paged out, as evident by event ID 17890.


The McAfee knowledge base article said this would happen when a new DAT file is received or when the McShield server starts up, so I decided to manually stop and restart the McShield service. As expected, I saw the same two event IDs in the application log. Below is the data collected by Performance Monitor for both of those tests.


The red line is the Working Set of SQL Server and the blue line is the Page Faults/sec for SQL Server. The first dip in the red line was from the DAT file update, and the second dip was when I manually restarted the McShield service. Since SQL Server’s buffer pool had just been paged out, it needed to read those pages from disk which results in a page fault.

The next step was to verify McAfee had all the correct exclusions for SQL Server.


Now that we know the exclusions were set properly, we can assume the issue is caused by the "Processes on enable" setting as previously mentioned. Let's turn off that setting and rerun our tests


After giving HammerDB a few minutes to generate some activity, I issue a manual DAT file update through the McAfee GUI. The application log shows the Event ID 5000 with a new DAT version of 8507, but no record of Event ID 17890 and there was nothing in the SQL Errorlog about memory paging.


Next, we'll test the manual restart of the McShield service. And we get the same result; Event ID 5000 but no indication of memory paging. Looking at perfmon, the counters show SQL Server's Working Set and Page Faults were unchanged during the tests.


Once we completed these tests, we took that data back to the security team and asked them to turn off that setting within McAfee. They obliged and so we waited and monitored our SQL Servers for the next few weeks. During that time, we did not have any SQL Server page out memory during the times that McAfee was getting new DAT updates.

After doing a little more research on McAfee VSE, I discovered this setting "Processes on enable" is enabled if you install McAfee with the Maximum Protection level. The Standard Protection level does not have this option enabled.


The key takeaway from this whole experience is that even though a program may be a standard install for all servers doesn't mean it won't impact SQL Server. And even though it may have been fully vetted in the past, a change to its settings could impact SQL Server in the future.

The lab tests were done using McAfee VirusScan Enterprise 8.8 Patch 7. I know this behavior is still the same in VSE Patch 8, so I would assume it’s the same in the other patch levels as well.

Additional resources:
https://kc.mcafee.com/corporate/index?page=content&id=KB76157
https://blogs.msdn.microsoft.com/karthick_pk/2012/06/22/a-significant-part-of-sql-server-process-memory-has-been-paged-out/
https://support.microsoft.com/en-us/help/2028324/how-to-troubleshoot-sql-server-error-message-17890