For an upcoming visual series on designing the news contained in the Guardian newspaper, I’ve been data mining through a weeks worth of the papers, Monday to Saturday (Sundays’ issue is the Observer so no Guardian then), and disecting all of the information. The information I took from the paper in order to create the visuals are available (in an OpenOffice spreadsheet) for anyone who may want to see it. Some of the information I take will change from day to day depending on the type of visual I am trying to create, but will all have a number of common statistics.
UK NEWS – 2706 / 541.2 / 11.73% / 98.65mm
RAF and navy hardest hit by £4.5bn MoD cuts – 872 / p4
Miliband urged to regulate private military — 475 / p9
Brown hints at taking powers from Holyrood – 337 / p10
Sick veterans being let down, say MPs – 316 / p10
Brown and Cameron woo farmers’ union – 706 / p13
Having the data in a spreadsheet means I can filter out the information I need to create visuals with the information. Some of the information I’ve been looking at include the total amount of words in the category, the average article length by author, category, day etc, the percentage of the newspaper occupied, total amount of words, total no of pages, total amount of pages containing news, most popular stories, most popular categories, etc. etc. etc. Now that I have the info, it’s just about using it.
And now for the maths
This gets rather tedious after a while, but here’s an example of how I’ve been extracting the data for a couple of filters.
- Story averages for each category =
Total No. of words in the category / Total No. of stories
- Total category percentages =
(Total No. of words in the category / Total No. of words in all categories) * 100
- Total amount of vertical space category should hold on an A1 poster =
Height of A1 poster (841mm) * 0. Total category precentages)
Even without creating the visuals, just having all of the news printed in the Guardian for one week is pretty interesting to see. Already you can see the trends in the news, and how the paper ranks the news depending on the position it is printed, and the amount of words for the stories.