Nuts and Bolts Samba Troubleshooting

Samba pitfalls in daily operation

Skillful Guide

Once you install Samba, you can still encounter errors, problems, and inconsistencies that plague your file server. We address some of these issues. By Stefan Kania

Samba has been the link in heterogeneous networks with Windows and Unix for many years. Once you have mastered the installation, though, everyday operation can present some potential pitfalls, to which you can only really respond correctly if you have prepared for the system errors. After discussing how to use a Samba file server securely in heterogeneous environments in a previous article [1], I will now look at problems that can occur during daily operations.

Replication Errors

At some point, you might notice that users you create on domain controller A are not listed on domain controller B. However, a user that you create on domain controller B is displayed on domain controller A. You then try to perform the replication manually and get the error message in Listing 1.

Listing 1: Error During Replication

root@addc-01:/tmp# samba-tool drs replicate addc-02 addc-01 dc=example,dc=net
ERROR(<class samba.drs_utils.drsException'>): DsReplicaSync failed - drsException: DsReplicaSync failed (58, 'WERR_BAD_NET_RESP')
   File "/usr/lib/python2.7/dist-packages/ samba/netcmd/drs.py", line 389, in run drs_utils.sendDsReplicaSync(server_bind, server_bind_handle, source_dsa_guid, NC, req_options)
   File "/usr/lib/python2.7/dist-packages/samba/drs_utils.py", line 87, in sendDsReplicaSync
      raise drsException("DsReplicaSync failed %s" % estr)

No matter from which of the two domain controllers (DCs) you try to replicate, the same error message is always thrown. However, if you reverse the directions, the replication suddenly works. It looks as if the database can no longer grow, because users are no longer fully displayed. You can see that both DCs are responding and communicating correctly during replication because the samba-tool drs showrepl command does not output an error.

In this case, you need to check whether you have enough free space on the hard drive. If a DC runs out of space, the database cannot grow; therefore, no new objects can be created. Make sure you have enough free disk space, restart the Samba service, and test replication again. Replication should now work fully once again, and all of your users should be present on all of your DCs.

To get a better grip on the problem, it might be useful to create a separate partition for the directory where your databases are located. You should also always monitor your system's "fill level," preferably with one of the many free monitoring tools.

Authorization Problems with ACLs

After creating a new group policy object (GPO), you need to test the permissions for the entries in the SYSVOL share, for which you might then see the output in Listing 2. However, this is not an error, because you need to authenticate to query the permissions of the GPOs. Without authentication, you won't see a correct result. A new attempt with authentication gives you the output from Listing 3.

Listing 2: Missing ACL Query Rights

root@addc-01:~# samba-tool gpo aclcheck
ERROR(runtime): uncaught exception - (3221225506, '{Access Denied} A process has requested access to an object but has not been granted those access rights.')
   File "/usr/lib/python2.7/dist-packages/samba/netcmd/__init__.py", line 176, in _run return self.run(*args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/samba/netcmd/gpo.py", line 1148, in run fs_sd = conn.get_acl(sharepath, security.SECINFO_OWNER | security.SECINFO_GROUP | security.SECINFO_DACL, security.SEC_FLAG_MAXIMUM_ALLOWED)

Listing 3: ACL Error Message

root@addc-01:~# samba-tool gpo aclcheck -U administrator
Password for [EXAMPLE\administrator]:
ERROR: Invalid GPO ACL O:DAG:DAD:PAI(A;OICI;0x001f01ff;;;DA)(A;OICI;0x001f01ff;;;;EA)(A;OICIIO; 0x001f01ff;;;CO)(A;OICI;0x001f01ff;;;DA)(A;OICI;0x001f01ff;;;SY)(A;OICI;0x001200a9;;;ED)(A;OICI;0x001200a9;;;S-1-5-21-1129951053-411964844-750776748-1105)(A;OICI;0x001200a9;;;DC) on path (example.net\Policies\{B30A27B8-8221-42B7-BA9F-BC6D2E9D7227}), should be O:DAG:DAD:PAR(A;OICI;0x001f01ff;;;DA)(A;OICI;0x001f01ff;;;EA)(A;OICIIO; 0x001f01ff;;;CO)(A;OICI;0x001f01ff;;;DA)(A;OICI;0x001f01ff;;;SY)(A;OICI;0x001200a9;;;ED)(A;OICI;0x001200a9;;;S-1-5-21-1129951053-411964844-750776748-1105)(A;OICI;0x001200a9;;;DC)

A test with the command

samba-tool ntacl sysvolcheck

leads to the same error message, which refers to an error in the authorization of the GPO listed here. Also, identifying the GPO by its ID and not its name does not make things any easier. The underlying problem is described here in detail. A specific ACL, O:DAG:DAD:PAR, is expected, but the setting is O:DAG:DAD:PAI. The only apparent difference is that PAI is set, but PAR is expected.

To understand what this means, you need to break down the message a bit further: O:DA indicates that this is about the rights for the owner: the Domain Admins group, here. In this case, the owning group is specified as G:DA – again, the Domain Admins. The permissions (DACLs, or discretionary access control lists) then follow: D:PAI, in this case. The translation is: protect against inheriting (P) and automatically propagate the ACL to the child object (AI). The AR permission, which essentially means the same as AI, is expected, but the check here determines whether the filesystem also supports automatic propagation of ACLs, which all current filesystems do under Linux.

The warning is not a problem. The bug exists up to version 4.8. Although the message is ugly, you do not need to worry. If you want to fix the warning, and thus the permissions, you can do so with the

samba-tool ntacl sysvolreset

command, but whenever you edit a GPO, the warning will simply reappear.

Time Differences on the DCs

If you determine that users are not replicated to one of the DCs when they are created, you need to test replication on the DC that did not receive the new objects:

# samba-tool drs replicate addc-02 addc-01 dc=example,dc=net
Replicate from addc-01 to addc-02 was successful.

You might then notice that the replication seems to work, but the object is still not listed. In the next step, you need to test replication on the DC on which you created the object that failed to replicate. If you see the error message from Listing 4, you have already taken one big step toward fixing the problem.

Listing 4: Time Difference

root@addc-01:~# samba-tool drs replicate addc-02 addc-01 dc=example,dc=net
Failed to bind - LDAP client internal error: NT_STATUS_TIME_DIFFERENCE_AT_DC
Failed to connect to 'ldap://addc-02' with backend 'ldap': LDAP client internal error: NT_STATUS_TIME_DIFFERENCE_AT_DC
ERROR(ldb): LDAP connection to addc-02 failed -LDAP client internal error: NT_STATUS_TIME_DIFFERENCE_AT_DC

As you can see from the message, the time on the DC that did not receive the object no longer seems to be correct. Check the time, reset the time, and find out why the DC no longer displays the correct time. The reasons could be:

The time server running on your network no longer works or is not accessible.
You are synchronizing the time directly with a time server on the Internet, but your firewall has been reconfigured and the port for the NTP service has been blocked.
Check whether systemd-timesyncd.service is running on the server. If this is not the case, restart the service and check whether the service failed because of an error on your DC.

After correcting the error, replication should work as usual again.

Synchronous time is so important because all Active Directory services are secured by Kerberos. When two Kerberos-protected systems communicate, the time difference must not exceed five minutes; otherwise, communication will fail.

Only the local time is required when creating a new object. The time between the DCs only becomes relevant during replication. Replication will not work here, but why was the replication on the problematic DC successful? Easy: It has no changes, so it doesn't want to transfer anything; therefore, it doesn't get to the point where time plays a role.

File Server Problems

Even when setting up a file server, you'll find pitfalls that keep a file server from starting, or at least providing its services. At this point, I'll look at what can happen when setting up a file server. I will not go into the installation of the packages; this explanation is only about the service configuration. The distribution you are using to set up the file server does not matter.

To begin, you should always complete the basic configuration of the file server; that is, you only configure the global area and execute a domain join. Once you have done this, only configure the shares. After you have prepared smb.conf, start the first attempt to integrate the server into the domain. You will see the message in Listing 5.

Listing 5: Join Error

root@fs-01:~# net ads join -U administrator
Enter administrator's password:
Failed to join domain: failed to lookup DC info for domain 'EXAMPLE' over rpc: {Operation Failed} The requested operation was unsuccessful.

Checking the Name Server

The error message in Listing 5 indicates that the DC could not be found. First, test whether you can resolve the DC name and whether you can ping the DC. In this case, the name cannot be resolved. A look at the /etc/resolv.conf file shows that no DC is registered as the name server. Make sure at least one DC is set up as the name server in the configuration.

It is better to enter two DCs as name servers; in this case, the second DC can take over the role of the name server if the first DC fails or has to be taken off the network for a short time. File servers in particular should always have at least two name servers configured, so they still work even if one name server fails.

If you already have entered the IP address of at least one DC as the name server but still receive the error message from Listing 6 on trying to join, you need to check the /etc/hosts file to see whether it contains the correct hostname with the correct IP address. The FQDN is correct if the command hostname -f returns the expected value. After modifying the entry in /etc/hosts, you can again try to join the file server to the domain. If you get the error message from Listing 7 now, the error is not caused by the file server; rather, the DNS server has problems with dynamic updates.

Listing 6: Wrong Hostname

root@fs-01:~# net ads join -U administrator
root@fs-01:~# net ads join -U administrator
Enter administrator's password:
Using short domain name -- EXAMPLE
Joined 'FS-01' to dns domain 'example.net'
No DNS domain configured for fs-01. Unable to perform DNS Update.
DNS update failed: NT_STATUS_INVALID_PARAMETE

Listing 7: Problem on the DNS Server

root@fs-01:~# net ads join -U administrator
Enter administrator's password:
Using short domain name -- EXAMPLE
Joined 'FS-01' to dns domain 'example.net
DNS Update for fs-01.example.net failed: ERROR_DNS_UPDATE_FAILED
DNS update failed: NT_STATUS_UNSUCCESSFUL

To see whether dynamic updates are working, test the DCs:

samba_dnsupdate --verbose --all-names

You will see output all entries on the DNS server. Listing 8 shows an excerpt from the output; the error message at the end is important. Here, the DNS entry update does not seem to work. To correct this error, proceed as described in Listing 9.

Listing 8: List of Name Servers

root@addc-01:~# samba_dnsupdate --verbose --all-names
IPs: ['192.168.56.66']
force update: A addc-01.example.net 192.168.56.66
force update: NS example.net addc-01.example.net
force update: NS _msdcs.example.net addc-01.example.net
force update: A example.net 192.168.56.66
...
update failed: NOTAUTH
Failed nsupdate: 2
Failed update of 29 entries

Listing 9: Fixing DNS Errors

root@addc-01:~# samba_upgradedns --dns-backend=BIND9_DLZ
Reading domain information
DNS accounts already exist
No zone file /var/lib/samba/bind-dns/dns/EXAMPLE.NET.zone
DNS records will be automatically created
DNS partitions already exist
dns-addc-01 account already exists
...
root@addc-01:~# systemctl restart bind9

The command

samba_dnsupdate --verbose --all-names

should now run without errors. Check the updating on all your DCs and fix the error on other DCs, if necessary. If the test command returns an update failed: NOT-AUTH error, something is wrong with the authentication of BIND9 via Kerberos in the Active Directory. Check whether you have entered the following line correctly in the /etc/bind/named.conf.options file:

tkey-gssapi-keytab "/var/lib/samba/private/dns.keytab";

If the entry exists, the BIND9 user might still not be able to read the keytab file. Checking the authorizations and setting the appropriate permissions should also fix the error.

Now you can start the join on the file server again, only, continue your tests once the output of the command looks like Listing 10.

Listing 10: Successful Join

root@fs-01:~# net ads join -U administrator
Enter administrator's password:
Using short domain name -- EXAMPLE
Joined 'FS-01' to dns domain 'example.net'
root@fs-01:~# net ads testjoin
Join is OK

User Mapping Problems

After modifying the /etc/nsswitch.conf file to use winbind, you might notice that although the user is displayed with

wbinfo -i <AD-User>

the mapping to the unique identifier (UID) does not seem to work (Listing 11).

Listing 11: Wrong User Mapping

root@fs-01:~# wbinfo -n test-u1
S-1-5-21-831035265-3946242641-4171447920-1408 SID_USER (1)
root@fs-01:~# wbinfo -i test-u1
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user test-u1
root@fs-01:~# getent passwd test-u1
_

Unfortunately, this error message is absolutely misleading. The domain is there; you can prove this by listing the users with wbinfo -n test-u1; wbinfo -p is also successful. The Winbind service seems to be running, and the domain is reachable. The problem can thus only be the ID mapping settings in the smb.conf file. The settings for the ID mapping in the file are:

idmap config * : range = 10000 - 19999
idmap config EXAMPLE : backend = rid
idmap config EXAMPLE : range = 1000 - 1999

You can see here that the range for the EXAMPLE domain is too small. The UID is calculated from the relative identifier (RID) of the user: In the case of the test-u1 user 1408, adding the first value from the range (i.e., 1000) results in a value of 2408. This value is outside the range of 1999.

Therefore, the user can no longer be mapped. If you specify the range, be sure to select a value that is large enough. Especially if you have migrated to Samba 3 with openLDAP, the RID can have a value greater than 100,000. After adjusting the range, stop the Winbind service, run the net cache flush command, and restart the service.

Another error that can occur when configuring ID mapping is that the second specified value is smaller than the first value. Then, the user may be displayed with:

wbinfo -i <AD-User>

However, the UID does not match the user's RID. Therefore, always check whether the values of the UID for a user match the RID.

Bottom Line

After completing a Samba installation, you can still encounter some minor pitfalls that prevent successful operation. In this article, I looked at typical problems, such as time synchronization, ACLs, user mapping, and the file server and discovered how to fix them. Congratulations, your Samba file server is now a full member of the domain.