Tuesday, 13 November 2007

Questions & Answers (Updated)

To answer as many questions as possible I decided to write a new post, I will update and re-post this page as I answer more questions.

Q. How do you collect this data?
A. I’ve alluded to the way I collect the data in previous posts. I have written a specialised piece of software specifically to data mine the Armory and other sites. For example, to get it started I can point it at the Armory Arena Ladder pages where all participating teams are listed. For each Arena Team the team member characters are fetched, if they are guilded then the guild list is fetched and all guild members are retrieved. This technique has some problems; players that are not guilded and not part of an arena team are ‘invisible’. I have thought about different ways to counter this but I’m pretty satisfied that the data is representative anyway.

Q. Do you use the ‘last modified’ or ‘last logged in’ parameters to determine if the character is still used?
A. I record all that information but I don’t discriminate against ‘stale’ characters, if you look at the ‘Level Count’ in my earlier post you will see a huge bump at level 60 denoting people who haven’t yet upgraded to The Burning Crusade but also characters that haven’t been played in some time.

Q. Can you produce reports for levels 20, 40, 60 as well as level 70?
A. Absolutely, this is something I really want to get to do. Presenting it is my only sticking point, I will be including levels 1-70 in charts where possible. See the two graphs below for an example of what I am working on producing in the future.

Note: These charts are produced from a tiny subset of data just to test out my charting software. Interestingly, the 'Talent Spec Balance' chart illustrates that for Priests, Shadow is the best levelling spec but the balance quickly switches over to Holy from level 68/69 as Priests become more in demand as healers.

Q. Can you segment your data by Arena Rating / Gear Rating / PvP Kills?
A. Yes, I probably can but at the moment I am concentrating on updating my basic class-specific reports. Arena Rating and PvP Kills are easy to build discriminators on and I have worked on mechanism to segment level 70’s by their ‘gear rating’ (just dinged 70, casual & raider).

Q. Why is realm (name) missing from your stats?
A. I have not included all realms yet, in Europe for example I have not picked up any of the non-English realms.

Q. How are you getting data from levels 1-9, Armory don’t show that?
A. The Armory *used* to show low level characters, I noticed them disappear mid-October. So I still have a lot of 1-9 character data, during each refresh I try to fetch these character pages anyway should they reach level 10 or higher then I will get updated data for them.

Q. Can you show a breakdown of levels/classes by battlegroup, realm, realm type, and Horde / Alliance balance at 70?
A. Absolutely, it has officially been added to my TODO list.

Q. Can you profile most popular classes in arena teams and group them by their rating?
A. Absolutely, this too has officially been added to my TODO list.

Q. Can you breakdown your class-specific stats by faction to see if racials make a difference in class choice for each faction?
A. Yes, this is doable too I’ve kept as much data as possible with a view to do ever more sophisticated reports with it.

UPDATE 13th November 2007

Q. Can you track population changes over time, for example talent balances as patches change dynamics in the game?
A. Yes, originally I wanted to keep full revision history for each character, each time my spider would fetch their character page an 'xml diff' of changes would be stored. Instead, I chose to export my reports in an XML format - not the abstracted percentages you see here but raw numbers, the idea being I could compare reports to show the high-level changes over time. I have this in a limited form for patch 2.1 (May 2007), much richer information since patch 2.2 (Sept 2007) and eventually I will be able to publish the shifts from patch 2.2 to 2.3.

Q. What hardware do you run this on?
A. The spider software, database server, file-storage and report generation is all running from one of my old servers at the moment. Spidering is not particularly intensive as its a fairly slow process. To be specific the machine is a Dual AMD MP 2000+ (1.67Ghz), 2Gb PC2100 DDR RAM, 2x18Gb SCSI discs and 250Gb IDE disc, its running Redhat Linux 9 the only non-standard thing is that the XML repositiory is on a ReiserFS partition. Backup-wise, I have a full mirror of all the data on my home RAID5 partition.

Q. Do your computations take forever?
A. It depends on the type of report, for example the Realm Balance reports are all done using information in the database and take around 5 minutes to run. The reports that breakdown details on characters obviously require the XML character sheets and take much longer; these are done in phases.
   1). Select a list of desired characters from the database by class, level etc.
   2). Iterate through each character, fetch the XML sheet and extract all the required information and add it to a temporary 'report' table in the database.
   3). Use the 'report' table to produce stats, this step can be repeated without re-doing the first two allowing me to develop the reports without having to re-fetch all the data.
This process runs at about 250,000 characters per hour.

Q. Have Blizzard contacted you about the load / bandwidth you are generating?
A. No, Blizzard have not contacted me about the load my spider generates. I invested more time in the 'collector' aspect of the spider than any other part; there is a lot of timing / delays / caching which prevent it from producing a crippling load. The spider supports a huge range of HTTP/1.1 features including things like If-Modified-Since and Accept-Encoding compression if the Armory servers choose to honor them, its User-Agent string clearly identifies this project and a link to this blog and lastly, all traffic comes from a single IP.
Interestingly, over at The Build Mine Kuroshiro spotted a notice on the Armory updates page saying "... third-party sites that mine Armory data may need to make adjustments to account for the new file configurations ..." which implies they are aware of mining projects but for the moment haven't explicitly banned them.

Q. Why do you use an apostrophe to pluralize your class names in the chart titles?
A. Because at the time the apostrophe provided an easy way to separate my PHP variable name from the "s" (see code) which I have recently fixed.

Q. Can you show the Realm Balance reports but only for level 70 characters?
A. Yes, I just posted it.

Q. Can you determine how many Arena Teams are going to be included / disqualified by the tighter rules for titles "20 games played and at least one person with 20% of the total games played" ?
A. Yes, this too has officially been added to my TODO list.


If you have any questions / comments, feel free to post here or email me :)

12 comments:

Aurdon said...

Again Bravo!

I've spent a few hours trying to figure out how to do what you have already accomplished on your own. I don't have much experience in writing spiders but with a site like this there is no need really. Keep the data flowing!

Anonymous said...

It would be interesting to see the population fluctuations, every time blizzard announces a buff/nerf to a particular class, and how this affects the over all population base.

For example, back when hunters were massively overpowered, when bc first came out, everyone I knew rerolled a hunter. They were a literal plague. Starting a 5 man group, you would get 5 asking to join.

Same thing happened with warlocks, recently. Though they have calmed down, somewhat, as of late.

Now, with blizzards buffing of paladins, to surpass prot warrior tanking hp levels, I forsee a population drop of prot warriors either quitting, or respeccing arms to pvp - and a massive jump in protection paladins becoming main tanks.

Which will be interesting. Their superior aggro making certain enrage timer bosses trivial. A change of the whole dynamic of tanking in wow. Though I feel sorry for protection warriors. They cant really do anythign else, other than tank. Lol, I love squishing them, when you rarely see them in pvp.

Anonymous said...

One further step you could use is to compile a list of character names from the dataset. You could then use that list to search for character profiles in the armory. Follow that up with a guild search if the character profile is in a guild.

That would cover a larger palyerbase than you currently do.

It still doesn't get you all unguilded players but would get you more guilds.

Anonymous said...

Just wanted to send a big thanks from our entire guild. We really appricate your exelent job.

Keep it up!

Pain
Horde guild at Silvermoon EU

Anonymous said...

god, i've got so many questions... what hardware do you run this on? Do your computations take forever? Have blizzard contacted you personally about the enormous amount of page requests you must be making? either way, keep up the good work!

Unknown said...

The title of your sample_graph_hp_mp.png should say "10,000 Priests" instead of "10,000 Priest's".
I don't know why everyone on the internet suddenly seems to think that you need an apostrophe to pluralize.

Anonymous said...

Nice work! I'd love to see the class breakdown on my realm showing only lvl 70's. That information would be real interesting.

Anonymous said...

Very nice site and information!

I'd be interested to see a breakdown of gear worn by level 70 characters. A chart which showed, perhaps, the top 5 items in each item slot (helm, chest, etc) broken down by class.

Anonymous said...

If i compare the Full-Stats from 1. Nov. and August for my Realm it's kinda strange - 2,900 in August, and ~5k in November. Did it "miss" the characters in August?

Anonymous said...

createLineChartByLevel( $report_name,
array("baseHealth,"baseMana"),
"Base Health & Mana by Level (10,000 $class's)");


This is how to do it properly in PHP:

createLineChartByLevel( $report_name,
array("baseHealth,"baseMana"),
"Base Health & Mana by Level (10,000 ${class}s)");


${varname} is the explicit way of telling PHP what a variable name is.

Problem solved.

Okoloth said...

Thanks #10, this is exactly what I did to fix the problem.

Anonymous said...

Hello, I am very interested in the work you have done here and would love to see some updated statistics. Are you still working on this project? If not, is your source code available for others to learn from?