To answer as many questions as possible I decided to write a new post, I will update and re-post this page as I answer more questions.
Q. How do you collect this data?
A. I’ve alluded to the way I collect the data in previous posts. I have written a specialised piece of software specifically to data mine the Armory and other sites. For example, to get it started I can point it at the Armory Arena Ladder pages where all participating teams are listed. For each Arena Team the team member characters are fetched, if they are guilded then the guild list is fetched and all guild members are retrieved. This technique has some problems; players that are not guilded and not part of an arena team are ‘invisible’. I have thought about different ways to counter this but I’m pretty satisfied that the data is representative anyway.
Q. Do you use the ‘last modified’ or ‘last logged in’ parameters to determine if the character is still used?
A. I record all that information but I don’t discriminate against ‘stale’ characters, if you look at the ‘Level Count’ in my earlier post you will see a huge bump at level 60 denoting people who haven’t yet upgraded to The Burning Crusade but also characters that haven’t been played in some time.
Q. Can you produce reports for levels 20, 40, 60 as well as level 70?
A. Absolutely, this is something I really want to get to do. Presenting it is my only sticking point, I will be including levels 1-70 in charts where possible. See the two graphs below for an example of what I am working on producing in the future.
Note: These charts are produced from a tiny subset of data just to test out my charting software. Interestingly, the 'Talent Spec Balance' chart illustrates that for Priests, Shadow is the best levelling spec but the balance quickly switches over to Holy from level 68/69 as Priests become more in demand as healers.
Q. Can you segment your data by Arena Rating / Gear Rating / PvP Kills?
A. Yes, I probably can but at the moment I am concentrating on updating my basic class-specific reports. Arena Rating and PvP Kills are easy to build discriminators on and I have worked on mechanism to segment level 70’s by their ‘gear rating’ (just dinged 70, casual & raider).
Q. Why is realm (name) missing from your stats?
A. I have not included all realms yet, in Europe for example I have not picked up any of the non-English realms.
Q. How are you getting data from levels 1-9, Armory don’t show that?
A. The Armory *used* to show low level characters, I noticed them disappear mid-October. So I still have a lot of 1-9 character data, during each refresh I try to fetch these character pages anyway should they reach level 10 or higher then I will get updated data for them.
Q. Can you show a breakdown of levels/classes by battlegroup, realm, realm type, and Horde / Alliance balance at 70?
A. Absolutely, it has officially been added to my TODO list.
Q. Can you profile most popular classes in arena teams and group them by their rating?
A. Absolutely, this too has officially been added to my TODO list.
Q. Can you breakdown your class-specific stats by faction to see if racials make a difference in class choice for each faction?
A. Yes, this is doable too I’ve kept as much data as possible with a view to do ever more sophisticated reports with it.
UPDATE 13th November 2007
Q. Can you track population changes over time, for example talent balances as patches change dynamics in the game?
A. Yes, originally I wanted to keep full revision history for each character, each time my spider would fetch their character page an 'xml diff' of changes would be stored. Instead, I chose to export my reports in an XML format - not the abstracted percentages you see here but raw numbers, the idea being I could compare reports to show the high-level changes over time. I have this in a limited form for patch 2.1 (May 2007), much richer information since patch 2.2 (Sept 2007) and eventually I will be able to publish the shifts from patch 2.2 to 2.3.
Q. What hardware do you run this on?
A. The spider software, database server, file-storage and report generation is all running from one of my old servers at the moment. Spidering is not particularly intensive as its a fairly slow process. To be specific the machine is a Dual AMD MP 2000+ (1.67Ghz), 2Gb PC2100 DDR RAM, 2x18Gb SCSI discs and 250Gb IDE disc, its running Redhat Linux 9 the only non-standard thing is that the XML repositiory is on a ReiserFS partition. Backup-wise, I have a full mirror of all the data on my home RAID5 partition.
Q. Do your computations take forever?
A. It depends on the type of report, for example the Realm Balance reports are all done using information in the database and take around 5 minutes to run. The reports that breakdown details on characters obviously require the XML character sheets and take much longer; these are done in phases.
1). Select a list of desired characters from the database by class, level etc.
2). Iterate through each character, fetch the XML sheet and extract all the required information and add it to a temporary 'report' table in the database.
3). Use the 'report' table to produce stats, this step can be repeated without re-doing the first two allowing me to develop the reports without having to re-fetch all the data.
This process runs at about 250,000 characters per hour.
Q. Have Blizzard contacted you about the load / bandwidth you are generating?
A. No, Blizzard have not contacted me about the load my spider generates. I invested more time in the 'collector' aspect of the spider than any other part; there is a lot of timing / delays / caching which prevent it from producing a crippling load. The spider supports a huge range of HTTP/1.1 features including things like If-Modified-Since and Accept-Encoding compression if the Armory servers choose to honor them, its User-Agent string clearly identifies this project and a link to this blog and lastly, all traffic comes from a single IP.
Interestingly, over at The Build Mine Kuroshiro spotted a notice on the Armory updates page saying "... third-party sites that mine Armory data may need to make adjustments to account for the new file configurations ..." which implies they are aware of mining projects but for the moment haven't explicitly banned them.
Q. Why do you use an apostrophe to pluralize your class names in the chart titles?
A. Because at the time the apostrophe provided an easy way to separate my PHP variable name from the "s" (see code) which I have recently fixed.
Q. Can you show the Realm Balance reports but only for level 70 characters?
A. Yes, I just posted it.
Q. Can you determine how many Arena Teams are going to be included / disqualified by the tighter rules for titles "20 games played and at least one person with 20% of the total games played" ?
A. Yes, this too has officially been added to my TODO list.
If you have any questions / comments, feel free to post here or email me :)