Thursday, October 29, 2015

Active Directory Issues - Network Drives Not Mapping

A customer of mine raised an issue in regards network drives not being mapped for users.  This includes drives mapped via Group Policy and Home Drives mapped via the NT4 Home Drive option of the Active Directory user account.

When users attempt to navigate to the UNC paths manually or map a drive manually it works as expected, however mapping network drives automatically upon logon simply did not work.

Also users were unable to navigate to the domain name "\\domain.local".  However users could navigate to "\\domain.local\netlogon" and "\\domain.local\sysvol".  Navigating to the domain root resulted in this error being generated:

\\domain.local is not accessible.  You might not have permissions to use this network resource.  Contact the administrator of this server to find out if you have access permissions.

Logon Failure: The target account name is incorrect.


This issue with not being able to navigate to the domain root UNC share occurred on all member workstations and servers throughout the organisation.  Domain Controllers were not effected by the issue.

The following three event logs were also found throughout the SYSTEM event log on all client workstations throughout the companies domain.

Log Name:      System
Source:        NETLOGON
Date:          26/10/2015 7:56:15 AM
Event ID:      5719
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      COMPUTER.domain.local
Description:
This computer was not able to set up a secure session with a domain controller in domain DOMAIN due to the following: 
There are currently no logon servers available to service the logon request. 
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.  

Log Name:      System
Source:        Microsoft-Windows-Security-Kerberos
Date:          23/10/2015 4:02:46 PM
Event ID:      4
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      COMPUTER.domain.local
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server candc1$. The target name used was cifs/domain.local. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (DOMAIN.LOCAL) is different from the client domain (DOMAIN.LOCAL), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Log Name:      System
Source:        Microsoft-Windows-GroupPolicy
Date:          28/10/2015 6:18:58 PM
Event ID:      1006
Task Category: None
Level:         Error
Keywords:      
User:          DOMAIN\UserAccount
Computer:      COMPUTER.domain.local
Description:
The processing of Group Policy failed. Windows could not authenticate to the Active Directory service on a domain controller. (LDAP Bind function call failed). Look in the details tab for error code and description.

Some computers were also receiving:

The processing of Group Policy failed. Windows attempted to read the file \\domain.local\sysvol\domain.local\Policies\{31B2F340-016D-11D2-945F-00C04FB984F9}\gpt.ini from a domain controller and was not successful. Group Policy settings may not be applied until this event is resolved. This issue may be transient and could be caused by one or more of the following: 
a) Name Resolution/Network Connectivity to the current domain controller. 
b) File Replication Service Latency (a file created on another domain controller has not replicated to the current domain controller). 
c) The Distributed File System (DFS) client has been disabled.

This GUID referenced the Default Domain Policy (the first policy in the domain).  There was nothing wrong with the Default Domain Policy on the customers network, policy was simply not applying to domain members due an issue in Active Directory.


One issue which I focused on was the Kerberos error "KRB_AP_ERR_MODIFIED".  I have seen this issue before when the same SPN was registered on at least two accounts. For example, a SPN was registered on two accounts: A and B. What happens is that KDC will generate a service ticket that may be encrypted with password of account A. Then, when the client sends that ticket to the service during authentication, the service may try to decrypt this using account B.

I searched the customers domain for duplicated SPN's using "setspn.exe -X" and found some however they were not related to the error received.  Also I have never seen the "KRB_AP_ERR_MODIFIED" error generated on EVERY domain member workstation/server in an Active Directory environment.

Looking further into the SPN's we decided to dump the entire domain with every object/attribute to a text file using:

ldifde -f out.txt -d dc=domain,dc=local

Generally SPN against a user account reference a member server on the network running a particular service account such as SQL  However as this issue was affecting the entire domain and KRB_AP_ERR_MODIFIED refers to duplicated SPN records, we looked to see if there were any SPNs set at domain level on an account by searching our output from ldifde for "host/domain.local".

We found an SPN set to the root of the domain from the search results.  Important content from output below is blurred to protect the privacy of the customer.


We went and removed the incorrectly setup SPN record from the problematic service account svc_adfs using Active Directory Users and Computers with Advanced Features turned on then forced replication with "repadmin /syncall /APeD".


After removing the incorrectly configured SPN, we purged the kerberos tickets off a workstation then attempted to start explorer at the root of the domain "\\domain.local".  We were able to navigate to this share successfully.


The issues with network drives not mapping on logon were also resolved.

The SVC_ADFS account was created as part of an AD FS deployment for federation with applications and Microsoft Cloud Services.  AD FS backend roll was installed on two corporate domain controllers and two proxy servers were deployed in a DMZ setup to process the authentication requests from external services.  This is the Microsoft Best Practice for corporate organisations under 1000 seats as it reduces the amount of servers required and provides high redundancy by leveraging NLB on both the backend and frontend AD FS servers as per:

https://msdn.microsoft.com/en-us/library/azure/dn151324.aspx

I went over the engineers build documentation who was in charge of implementing AD FS and could not see how the SPN was set, he did not manually set it.

Hope this post helps someone who experiences this same issue.

Monday, October 26, 2015

Enable Firewall Logging on Windows

Are your packets being dropped by Windows Firewall?  Want an insight into what is going on?  Simply open local group policy on a workstation / server (gpedit.msc) or configure a GPO in Group Policy Management Console (GPMC).  Under Windows Firewall with Advanced Security, go to the general properties.  Select the profile --> Logging and enable Logging on the set profile.  The log file by default goes to:

%windir%\system32\logfiles\firewall\pfirewall.log


Very handy for troubleshooting.

Exchange 2013 POP3 Proxy Inactive

A customer complained that POP3 was no longer working.  After looking into this, it turned out that PopProxy was Inactive on the Exchange 2013 server.  As to why it was inactive is unknown.


To start the PopProxy was challanging, generally you change ServerComponentState using the maintenance requester for most components.  However running the following command did nothing:

Set-ServerComponentState -State Active -Requester Maintenance -Component PopProxy -Identity AB-EXCH-01

To start the PopProxy component, I needed to use the Exchange 2013 Health API as a requester.

Set-ServerComponentState -State Active -Requester HealthAPI -Component PopProxy -Identity AB-EXCH-01

As shown below:

Exchange 2013 421 4.3.2 Service not active

A customer of mine upgraded an Exchange 2013 cluster node from Exchange 2013 CU7 to Exchange 2010 CU10.  After the upgrade, emails failed to come in on the cluster node with the following SMTP error being generated "421 4.3.2 Service not active".


This can be reproduced by simply telneting the faulty Exchange 2013 server.  After entering MAIL FROM: into the SMTP syntax, the error occurs and is shown numerous times throughout the receive connector protocol logs on the frontend transport stack.

After further investigation I found out that majority of the Server Components for the faulty Exchange Server were in an inactive state.


To bring the server back to an active state, ServerWideOffline was set to Active which resumes all services using a requester of Maintenance.  This was done with the following command:

Set-ServerComponentState -State Active -Requester Maintenance -Identity AB-EXCH-02 -Component ServerWideOffline

Note: ForwardSyncDaemon and ProvisioningRps is Inactive by default.

After running this command all Exchange 2013 components were back to an active state apart from components disabled by default.

Exchange 2013 Cluster Issues 0x80071736

A customer of mine went upgraded one node of a two node DAG from Exchange 2013 CU7 to Exchange 2013 CU10.  After installing the update on the first cluster node, they contacted us complaining complaining that the DAG was no longer available online in Failover Cluster Manager.

The following error was presented when attempting to bring the cluster online:

Failed to bring the resource 'Cluster Name' online.
Error code 0x80071736
The resource failed to come online due to the failure of one or more provider resources.


The resource it was complaining about was the Cluster IP address was unavailable (10.10.0.245).

In addition to this error, the following errors were logged in event viewer on a regular basis.

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          26/10/2015 8:45:01 AM
Event ID:      10028
Task Category: None
Level:         Error
Keywords:      Classic
User:          DOMAIN\administrator
Computer:      Exchange1.domain.local


Description:
DCOM was unable to communicate with the computer DAG1.domain.local using any of the configured protocols; requested by PID 6bc4 (C:\Windows\system32\ServerManager.exe).




Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          26/10/2015 9:58:05 AM
Event ID:      1223
Task Category: IP Address Resource
Level:         Error
Keywords:     
User:          SYSTEM
Computer:      AB-EXCH-01.domain.local
Description:
Cluster IP address resource 'Cluster IP Address' cannot be brought online because the cluster network 'Cluster Network 1' is not configured to allow client access. Please use the Failover Cluster Manager snap-in to check the configured properties of the cluster network.




These cluster nodes both had two interfaces:
  • Mapi Interface
  • Replication Interface

After patching one of the Exchange 2013 servers the MapiDagNetwork got disabled on the with IgnoreNetwork set to true on the DagGroupNetwork.  As this DAGNetwork contained the DAG cluster IP address 10.10.0.245 (within the 10.10.0.0/20 subnet), it forced the cluster offline.  To re-enable the cluster we simply needed to set IgnoreNetwork to false.


After setting IgnoreNetwork to false, we were able to manually start the DAG in FailoverManager by right clicking and clicking Bring Online.



Monday, October 12, 2015

Removing Large Amounts of Spam from Exchange Server Queue Database

A customer running Exchange 2010 experienced a large number of spam emails in the submission queue (over 80,000).


All the Spam Emails started with "ID#ALERT#", as a result we ran the following command to clean up any emails with "ID#ALERT#" in the subject.

Get-Message -ResultSize unlimited | Where-Object {$_.Subject -match "ID#ALERT#"} | Remove-Message


Due to the significant amount of spam, the command failed to remove the messages and just hung for hours.  However we could remove messages individually.

We could however list the emails with the following command:

Get-Message -ResultSize unlimited | Where-Object {$_.Subject -match "ID#ALERT#"}

This gave us an idea to write a foreach command to remove each email individually from a CSV file.  First we needed a detailed list with the identity of each spam email so we ran the following command:

Get-Message -ResultSize unlimited | Where-Object {$_.Subject -match "ID#ALERT#"} | select Identity > C:\output.txt

We then formatted the CSV file to ensure we specified a name for the Identity column.  As you see we added "Identity" to the top of the file.


We then ran the Remove-Message command multiple times for each Identity in the CSV file using the following command:

Import-Csv "C:\output.txt" | ForEach-Object {Remove-Message -Identity $_.Identity -confirm:$false}

This ran the Remove-Message command over 80,000 times for each message in the queue and was able to clean it up successfully.

Sunday, October 4, 2015

Websense Appliance Services Not Starting after Hard Shutdown

When a Websense v5000 or v10000 appliance is forcefully shutdown due to power loss or hard system failure, upon restart the Filtering service, Policy service, User service and Usage monitor can fail to start in the Websense Appliance Manager portal.  This is an issue I have seen more then once and as a result decided to do a write-up.

This issue occurs due to a number of temporary files which are not cleaned up (a process that occurs during a graceful shutdown).  To remove these files manually, we must connect to the appliance using an SSH shell session.

To connect to a shell session you need to login to Websense Appliance Manager first then under Administration --> Toolbox, click Technical Support Tools and find the passcode under "Websense Remote Access".  This is the password used for SSH which is randomly generated by the appliance.

 
Next login to the IP address of the Websense Appliance (the same IP you used for the Appliance Manager web interface).
 
The username is "websense-ts" and the password is the one obtained above.
 
 
Navigate to /opt/Websense/bin and remove all temporary p12 files which were not deleted due to an incorrect shutdown.
 
rm -f *.p12
 
 
 Also remove the journal.dat file under the /opt/Websense/bin using the following command:
 
rm -f 'journal.dat

After the temporary files have been removed, restart all websense services.  This can be done from the Websense Appliance Manager website or from the shell by running the following command:

./WebsenseAdmin restart


Now all the Websense services on the appliance have returned to a running healthy state.