Verified:

qzjul Game profile

Administrator
Game Development
10,263

Jul 30th 2010, 0:42:41

Hi all; Sorry about the major disaster we had there, here's what happened FYI.

I'm planning on moving the server on the weekend; so in preparation of that, I wanted to do a quick test to make sure that the server would, in fact, come back after I moved it.

Unfortunately, as you can see it did not.

Why not?

I forgot some important little details to do with RAID arrays back a while ago; we had a disk failure, and I hot-swapped it for a new drive, no problem, game kept on running and hardly anybody noticed anything ;) Then later I added a 3rd drive to the array, to make sure we had an extra mirror just in case.

However, I forgot to rebuild the mdadm.conf file and update the initramfs which basically makes it so that the kernel understands which drives belong to which array on booting.


So when I rebooted, it was expecting different partitions on different drives, and got totally confused. This was at 11pm my time. I suspect I might have figured it out last night, except for the fact that it took nearly 10 minutes each time I tried to boot for it to fail and drop me to a BusyBox prompt. This necessarily protracted the amount of time needed to test things.... I ended up booting to a LiveCD about 10x, verifying the RAID was good, and looking stuff up online trying to figure out why the heck the boot sequence couldn't figure out what the drives were =/ Anyway, I went to bed at 4am, got up at 7am to go to work, looked up a few things there, built a list of commands to try; got home at 6pm, and as I was booting thought of the solution, fixed it, and here we are.... well and it forced me to do a check of all the drives in the system, as there had been 240 days without a check... (the system had been online for 192 days -- that took 30 or 40 minutes).

So the lesson of the day:

If you ever change a RAID array (especially hot-swap).... update your mdadm.conf AND initramfs RIGHT THEN AND THERE, because if you reboot, everything will be totally fubared




Addendum:

To adjust for the unexpected downtime, all servers except express have been extended for a full day. All market packages will stay on the public market 19 hours longer than usual. All countries have had 19 hours added to the last time they played. For most countries, the downtime will not be significant.

There were around 30 or so countries that logged in before the db was updated. These countries will display a negative "Last Played" time on the portal. All that means is that these countries won't gain turns until 19 or so hours have passed.

We realize that this fix is not perfect, but it appears to be the most fair solution. Thank you for your understanding and sorry for the downtime.

Edited By: Slagpit on Jul 30th 2010, 1:06:10
See Original Post
Finally did the signature thing.

Uticant Game profile

Member
150

Jul 30th 2010, 0:51:44

What he said.

Thanks for making me feel retarded for your flub qzjul!

BobbyATA Game profile

Member
2384

Jul 30th 2010, 1:36:03

FYI (to other players): Over the 19 or so hours of downtime countries did not receive turns and private markets went unaffected.

I don't understand what qz said about players logged in not receiving turns for 19 hours. Perhaps these players did receive turns during the last 19 hours so they won't for the next 19 hours. Can someone plz explain?

enshula Game profile

Member
EE Patron
2510

Jul 30th 2010, 1:45:24

when qz says 19 hours added it seems like hes mean people who played just before the server went down should wait ~18 hours to get the 6 bonus turns

my countries all have the amount of turns as if time stopped during downtime

also things which were on route to market havnt hit yet

enshula Game profile

Member
EE Patron
2510

Jul 30th 2010, 1:47:31

also it sounds like 30 countries logged in before qz got a chance to set things up the way he wanted and got 19 hours worth of turns immediately

those countries only wont get turns for 19 hours to let everyone else catch up

Billyjoe of UCF Game profile

Member
1523

Jul 30th 2010, 1:48:32

i was going to say it was like i last played 4 hours ago lol... i was like yeah that was totally yesterday!

Slagpit Game profile

Administrator
Game Development
5055

Jul 30th 2010, 1:51:38

Originally posted by BobbyATA:
FYI (to other players): Over the 19 or so hours of downtime countries did not receive turns and private markets went unaffected.


This is only true for those players who didn't login to the game during the first few minutes after it came back up.

Originally posted by BobbyATA:
I don't understand what qz said about players logged in not receiving turns for 19 hours. Perhaps these players did receive turns during the last 19 hours so they won't for the next 19 hours. Can someone plz explain?


Those players who did login before we had a chance to adjust things received turns as they normally would have. As a result, they won't receive turns for the next 19 hours.

Dragonlance Game profile

Member
1611

Jul 30th 2010, 1:58:05

were there any large effects that have been noticed on the team server public market? or for that matter on any public markets?

it would appear not, as the game as a whole was down and whatever DB setup you did afterwards rectified any issues?:P

just want my bushels too sell!!:p

snawdog Game profile

Member
2413

Jul 30th 2010, 2:00:49

Well i dunno, something is whack with tourney though, i logged in a few minutes ago with only 19 turns..
So was tourney turns frozen?
ICQ 364553524
msn






Dragonlance Game profile

Member
1611

Jul 30th 2010, 2:03:56

they all were reset and time added on if u read slag/qz's posts, so as to make it as fair as possible.

snawdog Game profile

Member
2413

Jul 30th 2010, 2:07:01

ok, re-read..got it..
ICQ 364553524
msn






qzjul Game profile

Administrator
Game Development
10,263

Jul 30th 2010, 3:50:39

yea, hopefully that equalises everything.... the 29th didn't happen, we went straight from the 28th to the 30th :) well with 5 hours of a bit of both of them i guess heh
Finally did the signature thing.

jedioda Game profile

Member
395

Jul 30th 2010, 7:05:58

Great job qzjul. So now you have a RAID1 with 3 disks ?

... The only thing is that I will not play the tournament last day because I am going on vacation tomorrow.

AoS Game profile

Member
521

Jul 30th 2010, 7:10:50

I'm very upset with you now, qz >:( You didn't say we'd be missing days!!1!
The dreamer is banished to obscurity.

azmodii Game profile

Member
228

Jul 30th 2010, 9:07:38

I did that once. I never figured out how to fix it. I ended up ghosting the whole raid to one drive and rebuilding it.

You make me feel like a noob! :D

Good work Qz. That wasnt an easy fix!
- EoEA ~ End Of Earth Alliance -

"I will slaughter them like a wolf among lambs! The rivers will run red with the blood of my enemies, the skies will rain fire! And when the land parts beneath them... I shall be the in emptiness waiting!"

qzjul Game profile

Administrator
Game Development
10,263

Jul 30th 2010, 15:10:53

Heh, thx for the vote of confidence... I felt pretty dumb about 30 mins into the whole ordeal when I started to realize *why* it happened after bouncing some ideas off of slagpit...

It took me actually getting some sleep and having some time to think to realize that there is an /etc/mdadm/mdadm.conf in the initramfs as well as on the / RAID drive... and that I needed to regenerate it.


Alot of the confusion stemmed from the fact that the error was something like ALERT! /dev/disk-by-uuid/(hash) does not exist!
... when in fact I could *see* that it *DID* exist when I started with a LiveCD and ran mdadm



Also, I don't know what was with the seriously 5-15 minutes that it sat there saying "Loading..." before the error popped up and dropped me to a BusyBox shell... that was probably the most aggravating as it really slowed down the process of trying a number of different things....
Finally did the signature thing.

diez Game profile

Member
1340

Jul 30th 2010, 17:20:22

Anyway, good job qz ;D

Pangaea

Administrator
Game Development
822

Jul 31st 2010, 4:11:28

yes, qz did a great job at getting things up and running quickly. The downtime lasted for longer than we would have liked, but most of the time was when qz was asleep or at work, as has been noted.

part of the problem with this being a hobby and not our main job ;)
-=Dave=-
Earth Empires Staff
pangaea [at] earthempires [dot] com

Boxcar - Earth Empires' Clan & Alliance Hosting
http://www.boxcarhosting.com

Requiem Game profile

Member
EE Patron
9477

Jul 31st 2010, 14:47:35

I think most people here can under stand that. Qz just out of nerdy curiosity what kind of server do you have running this game? I build computers as a hobby so just wondering :p

I used to run a counter strike server that was setup with RAID. I must say that is a good way to go ;) Took the server down when I left college (no more free bandwidth haha)

qzjul Game profile

Administrator
Game Development
10,263

Aug 2nd 2010, 3:28:56

In terms of processor, it's just a Core 2 Duo; only 4GB of RAM (though I was thinking of upping that to 8GB, not because we need it but because RAM is cheap and DDR2 isn't getting any cheaper...)

In terms of hard drives (the main topic of this thread heh)...

1x 2TB + 1x 2TB waiting on desk to be put in
2x 1TB
3x 500GB
1x 320GB


RAID1 across the 2TB + 1TB's of smaller partitions for / and /home


All in one of these: (Antec P180)

http://www.antec.com/...php?Type=Mg==&id=Njc0
Finally did the signature thing.