Minutes Waterloo Polaris Advisory Group March 14, 2002

Attendees:

Hon Tam Engineering Computing
Bruce Campbell Engineering Computing
Erick Engelke Engineering Computing
Stephen Sempson Science
Tim Farell Information System and Technology (IST)
Jim Johnston MFCF
Nevil Bromley Arts (Chair)
Daniel Delattre Applied Health Science (AHS) (Secretary)
Trevor Bain Environmental Studies
Ray White IST

Regrets:

Dan Hergott Math
Bernie Roehl ESAG Representative

Agenda

1) Nexus Disaster
We first experienced the FRS problem in early January.

This is how FRS is supposed to work. Before a DC replicates its data to another DC, it first put the data in the staging area of sysvol. This allows the DC to have a static copy of AD and the data can be replicated in the background. After each successful replication, the staging area data are deleted. The sysvol data is generated from the NTFRS database which in turn is generated from information in AD.

The problem that we experienced was with the generation of sysvol data from the NTFRS database. The bug caused a data corruption in generation which was passed to the other DCs and in turn caused a rippling effect in the domain.

  • DNS names were deleted from the database
  • The DCs were not able to find each other for replication
  • The data in the staging area are not deleted because there are not successful replication
  • The hard drive partition are filled up because the data in the staging area are not deleted
  • The DCs stopped responding to I/O requests because there wasn't enough space to hold temp files and swapping
  • The lack of disk space caused data integrity problem on the system volume
  • Whole AD domain failed.

What Microsoft tried to do to fix the problem was to rebuild the sysvol from an existing working copy on another server in the domain. When you do a sysvol rebuild, the old sysvol is deleted and a new one is generated. What Microsoft ended up doing the rebuild on was the GC of the domain. In order for a DC to become a GC, it must first have a full working copy of the sysvol data. When Nausicaa lost the sysvol data on rebuilt, the GC server went down. MS also did a rebuild on Eng2k. When Nausicaa went down, it took down Eng2k. When both servers came up, it tried to pull from each other for the sysvol information to create the new sysvol data. Since neither server had a working sysvol, none were generated.

Without a working Nexus domain sysvol, MS went to APEX to for generation. What they ended up getting was a default basic sysvol without any GPO policies. When we created a GPO in AD, the policy back-end is created in sysvol. Now with a basic default sysvol, we have the front-end link in AD of the GPOs but no back-end policy in sysvol. This moved caused us to re-create every GPO policy in the domain again.

Since then Microsoft has sent me two hotfixes for the bug problem. I did not install the first hotfix because it will corrupt Excel (from a MS tech). I've tested the second hotfix on the mirror domain and the stupid thing hangs the GC on every reboot. To this day, we still don't have a working hotfix for the FRS problem. There's no documentation about this problem on any of MS's tech sites either.

The data in AD is 95% stabilized and a new fail-over DC will be in production shortly.

2) Software Update
Most software package GPO's have been rebuild.

3) GPO Documentation
We should improve the documentation. Screenshots, spreadsheets, Fazam 2000

4) Polaris Samba Mounts
No info yet.

5) Web Page Update
Tim will bring the pages up to date.

6) Rename WPAG
Next meeting.

7) NAV 7.6.1
Requires admin privileges.

8) Requisites for DC's
1Gig Memory
Behind locked doors
Cooling
DC's will reside in Engineering and Math.


Created by: delattre@uwaterloo.ca 2002/03/22
Revised by: delattre@uwaterloo.ca 2002/04/17