Results for ""
Big tech firms think deeply about data and don’t leave any arbitrage on the table. Customer data collected online is aggregated and sold across various platforms, extracting all possible value through targeted advertisements and sales. The focus of private sector AI tools is to derive insights for the business and increase corporate profits. Non-tech businesses too are waking up to the fact that they own data that has dual use. Most hospitals now understand the importance of utilizing patient care data for analytics. Most supermarkets understand the importance of data residing on their ERP servers.
Can governments use similar ways of thinking to identify opportunities for public policymaking? How can one leverage the data in public sector, and what does leveraging even mean in this context? If we can answer these questions, we get closer to extracting the value out of the data.
This article explains the heuristics of 'Leveraging' and 'Combining' the data. These heuristics provide a way to think about utilizing public data with the Government to develop AI applications. I focus on the government for two reasons. First, government’s aim is not profit maximization. Most government departments don’t think like a business about the data they have, and thus end up not extracting adequate value out of it. Also, governments in developing countries usually do not carry out inventory analysis of their data. A new way of thinking might goad them to do so. Second, the leverage gained by governments by thinking innovatively about data is immense. It outstrips any business when it comes to generating social welfare. This is due to the ability of governments to intervene though public policy measures and drive positive outcome. Ultimately, there is no cause better than public welfare. I rely on examples taken from Indian context, but they have general applicability. I also suspend the part of the brain that worries about data privacy for this essay. That’s a separate topic.
The focus of this essay is towards developing AI applications for public policy purposes. Wherever I mention leveraging data, it is with this aim. Also, the scope is not limited to data that is generated online or in a closed setting. Data is generated everywhere. Even in a remote village of India, without Internet connectivity, the village compounder generates data when he requisitions for next batch of polio vaccine. Similarly, the postman who delivers the post in the village generates data about the condition of the connecting rural roads. The lone cooperative bank branch generates the data about the rural distress when it reports weekly account balance. Thus, we need a leap in thinking - to innovate in tapping the data, and to think about translating this data into AI based (or non-AI) applications.
Leveraging the data can happen in one of the two ways:
a. Capitalizing on the existing data
Government generates data by carrying out the mandated functions. For example, the indirect tax department generates data on production when they collect indirect taxes. The department of education generates data when it releases funds to schools for providing mid-day meal to the children. The health department generates data when it subsidizes a patient for medical procedures. The finance department generates data on subsidy provided to the farmers through banks during the crop season. These are important data, and the current usage of this data is for normal budgeting and financial reporting within the government. The researchers get limited access to data for studies to understand policy effectiveness. At times, government itself carries out analytics to better understand the efficacy and usage of funds. Some departments have also stepped up to use AI based tools - an example could be the tax department which uses AI to select cases for investigation.
What do we miss by continuing with this approach?
We miss the chance to capture valuable data which occurs as a byproduct of something else. For example, the funds given to the schools for mid-day meal program also generates data on number of meals provided every day, which could be a rough proxy for the attendance. The subsidy on farmer loans also produce data on rural prosperity or distress at a very granular level. The railway ticketing data can be leveraged to study seasonal labor migration pattern from one state to another. The subsidy on medical procedures generates data on patients and their pre-existing conditions. The monthly indirect tax returns can be leveraged to predict industrial performance.
By looking at the data that is generated as a by-product of normal government activities, we can develop tools that can predict variables of interest for policymakers. In the above examples, one can see that predicting attendance, or rural distress, is not the primary goal of the data generation process. It is only when we identify the by-product data that we start to see the potential use and applications.
b. Generation of new data from existing setup
Another way is to leverage the existing government apparatus to generate altogether new data. We are not talking here about the idea to deploy primary school teachers for census work or using government officers to conduct elections - that is temporary resource allocation. We are talking about generation of data that can be leveraged for developing AI applications.
Imagine fitting cameras on village postman's bicycle to capture condition of the rural roads. The data generated can be labeled and an AI can be trained to flag bad roads. One may use the cameras on toll stations to check movement and number of cargo trucks which can be a predictor of economic activity. Similarly, traffic cameras installed at crossroads and signals in a city for traffic monitoring can focus on movement pattern of two wheelers on roads to identify parts of roads that need repairs (two wheelers swerve around potholes). School teachers can be labelers to train an AI tool that predicts student's performance based on attendance and other factors. These are examples where we are look at leveraging existing setup to generate new data. This can then be used to train algorithms for public policy purposes.
2. Combining data
Different arms of the government generate and store data in silos, which when combined, can be useful to generate AI solutions. Standalone, the same data might not be as effective, or even useless.
Combining datasets is not an easy task, even if we assume away bureaucratic turf wars. This needs elaboration. The data from the supervised AI application point of view is split into X and Y. The X contain the inputs. Y is the prediction. For example, the X can be satellite images, and the Y a prediction of area under cultivation. When we have both X and Y in our data, we can start the process for developing AI application.
However, it is likely that X might be with one arm of the government, and Y might be with another, and it might not be easy to get them. Even if we get them, the linking might not be easy unless there is a common key or an ID that links both datasets. For example, poor people in India avail subsidized health benefits using Aadhar number which is a unique ID. The same Aadhar number is used for delivering financial subsidies by the government through banking channels. The bank accounts are mandatorily linked to Aadhar. But these two datasets (health and finance) lie with different departments. If we could combine the data, and join on the Aadhar number, we can have a good dataset. This dataset can be used to train an algorithm to predict financial health of a person (Y) given the health conditions (X). This could be a policy tool during the times of pandemic when people suffer financial distress due to health emergencies.
Thinking on X and Y in this manner, we can have the following combinations:
The first row above is straightforward. Data is with one department and linking is easy. The data on teacher and student attendance is maintained by the same department that deals with education. The data has unique IDs for teachers and students. Thus, combining data that contains students’ attendance (Y), and data that contains teacher’s punctuality and other inputs (X) is easy.
The second row in the table above is an example where data linking is difficult even though the data is within the same department. Farmers avail subsidy on seeds, fertilizers and insecticides which are linked to farmer's Aadhar number. At the same time, satellite images are used by the same agricultural department to predict acreage of sowing for various crops. This is used to predict agricultural output, prices and to plan for market operations. If we can link the dataset of farmers and crop prediction, we may use it to predict the income accruing to the individual farmers. This can help policymakers to understand rural prosperity/distress, especially in times of natural calamity. However, linking the data on crop acreage from satellite images to individual farmer's income is not easy. There is no common identifier on swathes of land as seen from satellite to the famer’s identification numbers. Efforts are currently on to help such linking in future. Various government projects are trying to map individual land ownerships with GPS coordinates of the land.
The third row in the table has been explained in an example earlier. While the data is with different departments, the linking is easy as we have a common identifier that exists in both datasets.
The fourth one deals with a case where X and Y are with different departments and can’t be linked easily. Corruption and money laundering activities rely on huge cash transactions. This is to avoid leaving any online trace of illegal activities. The cash enters and exits the economy through banks. There are hundreds of cases where scrutiny of cash transactions led to busting of money laundering gangs. There are also thousands of cases where cash transactions were benign. For a decision aid type of AI tool, which can assist the investigators to flag suspicious cases, the Xs may be inputs such as location of the bank, identity of cash withdrawer, magnitude, and frequency of transactions. The Y may be an indicator of whether it was benign transaction or a case of money laundering. This AI tool may serve as a decision aid to investigation agencies. However, the Xs are with finance department and the Y is with the anti-corruption department. When the arrests are made, all Xs are not captured by the investigation team. Thus, linking the data becomes tedious and manual process.
The above examples are general principles about how one may think about leveraging and combining data. Not all suggestions may become true AI solutions. Some may fail to take off after careful examination. The idea is to show the approach one may adopt to leverage data with the government and develop AI solutions.
Adopted from my medium post at: https://medium.com/p/3ae3de4ebe0
Image : Photo by Michal Matlon on Unsplash