UNSW UNSW Business School UNSW Business School

The promise and problems of including big data in official statistics

By Fleur Johns, Caroline Compton & Wayne Wobcke  November 13, 2018

​The Australian Bureau of Statistics will soon announce the kinds of information it will collect in the next national census in 2021. If international trends are a guide, big data will comprise a growing part of ABS data collection and analysis.

This may promise greater timeliness and efficiency compared with the traditional paper-based census, but using big data to measure populations and economies is not without challenges.

Debates about how democratic governments should count the people they serve are ongoing in Australia, the US and in India. The use of digital technologies for state measurement seems likely to intensify these debates as significant questions emerge around the practice.

For centuries, states have counted and categorised people. Census data and other official statistics are used for government planning and budgeting, to determine political districts for elections, and for many other purposes.

Official statistics also help to shape a population's sense of itself. For these reasons, state counting practices have often been controversial.

Because machine-learning methods for unstructured data are never 100% accurate, any inferences drawn must be carefully validated.


In Australia, changing census practice has been a part of ongoing debate about ensuring First Nations people are properly represented. Historic undercounting of Aboriginal and Torres Strait Islander people was redressed by the abandonment of language in the census that referred to blood quantums – which are now widely accepted as racist – alongside other factors.

In the US, state counting is likewise a matter of intense dispute. California is among those states presently suing the US federal government because of a question about citizenship status the Trump administration has proposed adding to the 2020 census.

California argues fewer non-citizens will complete the census if the question is included. This would lead to a lower population count and reduced federal funding for states with high numbers of non-citizens.

India has also seen heated national debate about the gathering of caste data and the categorisation of housewives as non-workers.

New issues of this kind are likely to emerge as government statistics offices around the world introduce digital data into their work.

The UN is presently spearheading efforts by member states to explore the use of new, digital data sources and technologies for official statistics. The ABS is involved in this endeavour. Since late 2017, for example, the ABS has been analysing supermarket scanner data to try to improve consumer price index (inflation) measurement.

Other possibilities being explored for the use of digital data to improve state measurement include:

  • Using anonymised mobile phone data – bought from or donated by commercial providers – for tourism statistics, to understand internal movement, commuter flows and population distribution, and to try to estimate characteristics of particular population sectors.
  • Web-scraping (extracting publicly available information from websites) to estimate labour force participation, or using Google Trends to try to 'nowcast' (get immediately up to date information) on unemployment.
  • Analysing satellite image and remote sensing data to estimate crop planting and predict harvest yield.

The aim of these efforts is to make official statistics more accurate, affordable to gather, and more attentive to geographically remote or otherwise marginalised communities. While there may be enormous potential to improve official statistics in these ways, big data use for state measurement raises thorny issues.

The first of these is the difficulty of auditing such data sources. All datasets come with blind spots and biases. Given the contentiousness of state counting, and the potentially high stakes of miscounting, it's important the public maintains an overall sense of – and capacity to query – how, where, and why data is being collected.

This may be difficult to ensure when data used for official measures are privately sourced.

While the ABS has the legal right to compel the provision of information, including from data providers, insight into how private companies collect and process data may be hard to obtain, and may not be shareable publicly.

Reliance on commercial data sources could also leave official statisticians dependent on privately owned infrastructure – cell tower infrastructure, for instance. The distribution and maintenance of this infrastructure is driven by commercial interests, potentially working against the needs of responsible public data collection.

Another problem with the use of big data in official statistics is that data gathered are often not fit for the kinds of purposes states are pursuing.

Data of this kind are messy and unstructured, and it can be hard to separate information from noise in their analysis. Because machine-learning methods for unstructured data are never 100% accurate, any inferences drawn must be carefully validated.

Statisticians are well aware of these limitations, but face challenges communicating with policy-makers and the general public about them.

There is a risk that because digital data are relatively abundant, those in charge of state measurement practices will make use of that data without due regard to questions of what should, and should not, be measured for particular purposes.

Without knowing when and how they are being counted, the public cannot be part of that discussion. It is incumbent on governments to bridge that gap, and incumbent on all Australians to take an active interest in these practices as they develop.

Fleur Johns is a professor and associate dean at UNSW Law, Caroline Compton is a postdoctoral research associate, and Wayne Wobcke is an associate professor at the UNSW School of Computer Science and Engineering. A version of this post appeared on The Conversation. 

comments powered by Disqus