Nuts and Bolts Server 2016 Optimization Lead image: Lead Image © Kirsty Pargeter, 123RF.com

Optimizing Windows Server 2016 performance

Torque Booster

With Windows Server 2016, tweaking the settings and taking advantage of performance monitoring tools can help boost your system's performance. By Thomas Joos

Windows Server 2016's default settings might not always meet your network requirements. Depending on its purpose, the server requires different tweaks to unleash its true performance. In this article, I will be optimizing the RAM, CPU, cache, and storage media.

The performance of Windows Server 2016 depends on the underlying hardware. Ideally, you should use the most up to date Server 2016 hardware possible [1] (e.g., 64-bit processors). The cores should run at the highest possible frequency, because a processor with fewer cores but twice the clock speed can be significantly faster than a processor with several cores and a normal clock speed. Therefore, more cores are not always significantly faster than fewer cores.

Above all, Hyper-V performance benefits from clock speed, because the hypervisor distributes a server's resources to the virtual machines (VMs), which can share a core. In this case, a higher clock speed is infinitely preferable to having more cores that are not used at all by the VM in question.

The RAM performance and size, as well as the storage media's I/O performance, must match the processor. Even if obsolete server hardware can be pimped with new processors, line bottlenecks still can occur quickly if the rest of the server's hardware does not match the new processor's performance.

If you are using Hyper-V, the processor must be able to handle Second Level Address Translation (SLAT). The function is integrated into Intel processors in the form of Extended Page Tables (EPT) and into AMD processors as Nested Page Tables (NPT). The function can be read out with systeminfo.exe and is shown as Second Level Address Translation. SLAT allows the hypervisor to accelerate memory access.

Measuring in the Right Place

In general, before optimizing, you should use performance monitoring to measure exactly where the server performance bottlenecks originated and if they are attributable to the processor or memory (Figure 1). The processor performance can be a bottleneck if there is not enough main memory for the CPU. After all, swapping out pages is bound to affect the processor.

Figure 1: Performance monitoring provides very accurate insights into the use of resources in Windows Server 2016.

CPU usage does not pose a problem if it is above 90 percent for a short period of time; however, if it stays at this level for an extended period of time, it does become a problem. In multiprocessor systems, the focus is on the System object performance indicators in performance monitoring. Information from several system components is summarized there.

The Processor Time performance indicator of the Processor object is also of interest. If many different processes are running, fairly even load distribution is important. In a single process, it is important to divide the load into balanced threads. A thread is a process execution unit. If a process uses several threads, they can be executed on different processors. The distribution is based on the utilization of the individual CPUs by the system. A large number of queues means that several threads are available for computing, but the system has not yet assigned them any computing time. The rule of thumb for this value is that it should not be too frequently greater than 2. However, if the CPU usage is relatively low on average, this value plays only a minor role.

A constantly high CPU utilization rate clearly shows that the processor in a server is overloaded. The Windows Server 2016 Performance Monitor shows you the performance indicator Processor: %Processor Time, which is the time required by the CPU to process a thread that is not idle. A constant status of 80 to 90 percent is too high. For multiprocessor systems, you need to monitor a separate instance of this performance counter for each processor. This value represents the sum total of processor time for a specific processor.

Additionally, you can monitor the processor via Processor: %Privileged time, which gives you the percentage of the total time the processor takes to execute Windows kernel commands, such as processing I/O requests. Further important indicators are Processor: %User Time, which returns the percentage of the total time required by the processor to run user processes. System: Processor Queue Length also includes the threads waiting for CPU time. A processor bottleneck occurs if a process's threads require more processor cycles than are available. If many processes are trying to take up processor time, you need to install a faster processor.

Optimizing RAM and Processor Cache

Microsoft recommends processors with the largest possible L2 and L3 caches. Some CPUs also offer an additional L4 cache known as Last Level Cache (LLC). Any cache can increase a server's processor performance far more than can a higher clock speed. Even with Windows Server 2016, you should install as much RAM as possible in the server. If the amount of memory is not sufficient to run a server application, Windows 2016 transfers data from memory to the hard disk. Even if the machine is equipped with SSD or a flash drive, memory significantly affects performance.

The best way to monitor memory on servers is to monitor performance during operation. First, pay attention to the value of Memory: Available Bytes, which shows how many bytes of RAM are currently available for use by processes. Low values can indicate that the total amount of memory available on the server is too low or that an application does not free up memory. Second, use Memory: Pages/sec to determine the number of pages that were read from or written to the disk due to page errors in order to free up space. A high value can indicate excessive swapping. Monitor Memory: Page Faults/sec to ensure that the disk activity is not caused by swapping out.

Fast Disks and PCIe

Of course, Windows Server 2016 should be able to access the fastest possible hard disk systems. Microsoft recommends PCI Express (PCIe) interfaces for the server's primary memory, but also for connecting the network adapters. You should also use at least PCIe x8 and network adapters with 10Gbps or more.

Data media on servers should have the highest possible revolutions per minute. The more revolutions per minute, the lower the access times. In general, data carriers with 15,000rpm are recommended. Here, 2.5-inch enterprise disks often offer shorter access times than their 3.5-inch counterparts. In general, Microsoft recommends the use of SSD or flash memory. NVMe SSDs in particular offer very high performance.

In Windows Server 2016, three storage tiers can be used in the storage spaces: NVMe, SSD, and HDD. NVMe memory is used for caching data, whereas SSDs and HDDs are used for traditional data storage and archiving. However, you can create different combinations of storage tiers with these three volume types. On servers, it makes sense to use different types of memory to get the best possible performance.

File Server Performance Optimization

To optimize performance, Microsoft recommends some registry adjustments to clients that communicate with a Windows 2016 file server via SMB. The DWORD value ConnectionCountPerNetworkInterface in the HKLM\System\CurrentControlSet\Services\LanmanWorkstation\Parameters path allows values from 1 to 16. This determines how many Server Message Block (SMB) connections a client with non-Receive Side Scaling (RSS) adapters can establish. RSS [2] controls network access on computers with multiple processors. Microsoft recommends that you set the value to 1. The value works for all computers running Windows 8/8.1/10 and Windows Server 2012, 2012 R2, and 2016.

If a client-side adapter supports RSS, the SMB value is controlled by the ConnectionCountPerRssNetworkInterface DWORD value in the same path. Here, Microsoft recommends a value of 4, but again a range of 1 to 16 is available [3]. However, before setting the entry on clients and servers, you should check carefully whether the respective values actually play a role in your environment. You should also use performance measurement and other tools to see whether performance has improved after adjusting the registry entries.

Accelerating SMB Traffic

Windows Server 2016 uses SMB 3.1.1.1. However, if Windows Server 2016 communicates with older Windows versions or Linux, the SMB version from the oldest system is used. However, when using current Intel processors, the new version offers significantly more performance. You can check the currently used SMB version in PowerShell with Get-SmbConnection on the clients and Get-SmbSession|FL on servers. Various registry values can also be adjusted to improve SMB performance [4].

SMB Direct is enabled between servers running Windows Server 2016. To use this function, the built-in adapters must support the Remote Direct Memory Access (RDMA) function. Overloaded servers can thus be accelerated by transferring data to underutilized servers. For this to work, the network must be extremely fast, and the adapters must be of the iWARP, InfiniBand, or RDAM over Converged Ethernet (RoCE) type. Hyper-V and MS SQL Server mainly benefit from this technology.

Optimizing the Web Server

The performance of Windows Server 2016 web servers can be improved significantly just with the use of on-board resources. For example, compression speeds accelerate a server's response time and save bandwidth when transferring web pages. You can control compression using the Compression feature in Internet Information Service (IIS) Manager (Figure 2). You can specify the size of files to compress and how much storage space is available to each application pool, including the web pages and applications it contains. The cache directory space is also defined here.

Figure 2: File compression in IIS Manager relieves the network, but burdens the CPU.

Some settings are only available at the server level; others are made at the website and application levels so that each application uses its own compression settings. Enabling compression burdens the server hardware, but network performance increases. Whether or not these measures will increase performance depends on whether the server or the wire is the bottleneck. Because the wire is most likely responsible for slow transmission, static content compression is already activated by default in IIS 10.

If you have already compressed static content, such as a page or a file, this does not burden the server again, because the file simply comes out of the compression cache the next time it is requested. If you enable compression for dynamic content, each transmission has to be compressed over and over again, which saves bandwidth but costs CPU power. To compress dynamic content, you first need to install the corresponding server role.

Parts of web pages can be made available in the web server's cache, so that retrieving these parts does not burden the server. You can manage this function with the output caching feature in IIS Manager. The basic settings are accessible with the Edit Feature Settings option from the context menu or in the Actions pane. In the settings, activate the function and set a limit. However, the cache is only useful if you have defined rules that determine which data the server should cache (Figure 3).

Figure 3: Temporary output buffering retains parts of a web page to speed up the page build-up.

You can also manage kernel caching here. Requests to the cache are not executed in the server's user mode, but in the kernel itself. The applications are therefore not burdened. IIS decides how much memory it provides. Only increase the limit if you discover that your server is not yet fully utilized, but be very careful, because it can quickly have the opposite effect.

From the context menu, you can create new rules for the cache. First, determine which files are to be cached and then how long the data should remain in cached memory. Define either temporary storage until the file is changed or a time interval. You can also generally prevent caching of some file types at this point.

Swap File Optimization on Terminal Servers

To make improvements to remote desktop session hosts, first move the swap file to another physical hard disk on the server so that write access to the swap file is not slowed down by write access to the hard disk. If a second physical hard disk is not available, moving it doesn't make any sense, because relocating to a partition on the same disk will not have a positive effect.

Additionally, you should set the swap file's size to approximately 2.5 times the actual memory size to minimize file fragmentation. The swap file settings can be found under Control Panel | System and Security | System | Advanced system settings | Performance | Settings | Advanced | Virtual memory | Change. Clear the Automatically manage paging file size for all drives checkbox, and check the Custom size option.

AD Performance Optimization

If you identify performance problems in Exchange or other server-based services that depend on Active Directory (AD), such as mailbox access or messaging, you'll often run into a problem in AD or DNS. Therefore, parallel to performance monitoring, you should also diagnose name resolution and the domain controller (e.g., with DCDiag). Exchange and many services that use Active Directory access it through the wldap32.dll system file.

A very large part of server performance depends on the network speed between Exchange Server and the global catalog or domain controller. For this reason, you should always measure your network's speed, in case of performance problems with the Exchange infrastructure.

The speed to the DNS server and fast, stable, and correct name resolution are also very important for AD. The response time to the DNS server must not exceed 50ms if you want to optimize performance. If the request takes longer, you already have the first bottleneck in Exchange performance. Pinging the server is sufficient; you don't even need performance monitoring.

The MSExchange ADAccess Processes indicator group in performance monitoring is important for Exchange and AD connectivity and is added to the server by the Exchange installation wizard. The two indicators LDAP Read Time and LDAP Search Time are of interest in this group. LDAP Read Time measures the time it takes the data from an LDAP query to be transferred, whereas the LDAP Search Time indicates the time it takes for the server to perform an LDAP search against AD. The average value for these indicators should be less than 50ms; the maximum duration should not exceed 100ms.

For AD-dependent services to retrieve data quickly from AD, the global catalog must respond quickly and must not be overloaded. To check its utilization, you can also use the Performance Monitor (Figure 4). Click on Data Collector Sets | System | Active Directory Diagnostics and then on the green triangle in the toolbar to start the collection set. After some time, you can end the measurement using the context menu of the collection set or the toolbar. Afterward, you can access the data of the last performed measurement from Reports | System | Active Directory Diagnostics.

Figure 4: Because AD performance can have a massive impact on Exchange performance, monitoring is important.

Conclusions

To get the most out of Windows Server 2016, your network's purpose is key to determining which tuning measurements to use. Depending on how you use the operating system, you can tweak a variety of settings to solve performance problems. In most cases, however, these decisions are based on detailed measurements with performance monitoring tools. Once the weak points have been identified, the measures presented here should ensure a performance boost.