Reducing data gaps for training machine learning algorithms using a generalized crowdsourcing application

By: Chatana Mandava and Nikhil Widhani, Statistics Canada

Introduction

Crowdsourcing is an online process in which a company or organization solicits contributions from a large group of people – this can be anything from ideas, content, services, to funding. This process allows companies to tap into the collective intelligence and creativity of individuals they have no connection with. It also helps companies access resources they would otherwise not have access to, such as new technology or expertise from outside their organization.

Crowdsourcing has emerged as a cutting-edge method of gathering important data for statistical purposes, as part of Statistics Canada's modernization. There have been multiple crowdsourcing projects Statistics Canada (StatCan) has implemented. These projects include:

  • The OpenStreetMap (OSM) crowdsourcing pilot project that crowdsourced geographic information by mapping building footprints in the Ottawa, Ontario and Gatineau, Quebec areas. This project helped launch the Building Canada 2020 initiative, which mapped all building footprints of Canada in the OSM, by the year 2020.
  • The COVID-19 crowdsourcing project in which the public use microdata file was released containing information from crowdsource questionnaires that helped analyze how COVID-19 has impacted Canadians' experiences with discrimination, sense of belonging, trust in institutions and access to health care services. This product is provided using StatCan's Electronic File Transfer Service. (see: Crowdsourcing: Impacts of COVID-19 on Canadians' Experiences of Discrimination Public Use Microdata File)
  • The Crowdsourcing-Cannabis project in which StatCan recently crowdsourced the details the publics' most recent cannabis transactions, including the amount, quality, location, and reason for use. Respondents were also asked how frequently and the amount of cannabis they consumed on average, each month (see: Crowdsourcing - Cannabis 2020). This initiative continues to collect information on a relatively new market and helps to monitor prices in a confidential and non-intrusive manner.

There's increasing demand within StatCan and other agencies to collect alternate sources of data generated from crowdsourcing. A recent proof-of-concept project developed by StatCan's Data Science Division, in collaboration with Centre of Special Business Products (CSBP) and Nutrition North Canada created the Indigenous Communities Food Receipts optical character recognition (OCR) project. This proof-of-concept collects grocery receipt images from northern communities within Canada. Key variables from these receipts such as price, product name and subsidy are extracted using OCR methods. Also, the nutrition AI proof-of-concept project by StatCan's Center for Population Health Data (CPHD) explored food images to collect nutrition data such as portion sizes and calories (see: Context modelling with transformers: Food recognition). The major component for the above two projects is the crowdsourced data though, the data being collected for these two projects are different. In these cases, a generalized application will help the organization to crowdsource different formats of data. This application can be reused to crowdsource for multiple projects. This will reduce the workload to create multiple applications to collect information.

These exploratory projects have inspired us to develop and expand its use cases by crowdsourcing various unstructured data formats like text, PDFs, and satellite images, to then be transformed into structured data using various machine learning techniques.

Motivation and value proposition

The motivation behind investing in such an application is to provide a one-stop solution for government organizations to find the minimal infrastructure required to host crowdsourcing applications. This will not only generate a new stream of data collection but will also allow us to investigate data diversity with unconventional solutions. The pool of data will cover more use cases for where data sources are limited and allow our machine learning models to increase in performance and scalability.

The value for developing a generalized crowdsourcing application is twofold. First, it's an efficient tool that collects data from a large sample size. This enables us to generate reliable and timelier statistics on various topics with low cost, such as population trends or economic development. Second, the application could be used to facilitate collaboration between the public and researchers by allowing them to share their knowledge and experiences with one another to generate better insights into important issues facing the country. By leveraging the collective intelligence of Canadians across all demographics, StatCan would have access to rich information that can inform policy decisions and improve public services.

Architecture

Data flow diagram of the crowdsourcing application
Figure 1: Data flow diagram of the crowdsourcing application

A high-level overview of the core functionalities and data flow of the application.

  1. Database, Minio Storage, User Authentication BACKEND
    1. forward to Data Analysis, Data download as CSV structured format, Machine learning extraction
  2. Data Analysis, Data download as CSV structured format, Machine learning extraction
    1. back to Database, Minio Storage, User Authentication BACKEND
    2. forward to Data Custodian 1
    3. forward to Data Custodian 2
    4. forward to Data Custodian 3
  3. Data Custodian 1
    1. back to Data Analysis, Data download as CSV structured format, Machine learning extraction
    2. forward to Crowdsourcing 1
  4. Data Custodian 2
    1. back to Data Analysis, Data download as CSV structured format, Machine learning extraction
    2. forward to Crowdsourcing 2
  5. Data Custodian 3
    1. back to Data Analysis, Data download as CSV structured format, Machine learning extraction
    2. forward to Crowdsourcing 3
  6. Crowdsourcing 1
    1. back to Data Custodian 1
    2. forward to Users
  7. Crowdsourcing 2
    1. back to Data Custodian 2
    2. forward to Users
  8. Crowdsourcing 3
    1. back to Data Custodian 3
    2. forward to Users
  9. Users
    1. back to Crowdsourcing 1
    2. back to Crowdsourcing 2
    3. back to Crowdsourcing 3

Figure 1 can be projected into three core sections:

Backend

The tables were saved using an SQLite database. SQLite is a relational database management system (RDBMS) contained in a C library. Unlike other database systems, you don't have to configure or instal it to use it. It stores data in tables like other RDBMSs such as MySQL and PostgreSQL but requires less memory and disk space than these systems. SQLite databases can be used for applications ranging from small single-user projects to large, distributed web applications with millions of concurrent users. Data custodians who own the crowdsourced data, can access them in a structured format. In addition, the application will authenticate certain users who are administrators or developers of the application to manage security and functionalities. The schema used for this project is displayed in the below diagram.

Schema used for a generalized crowdsourcing application
Figure 2: Schema used for a generalized crowdsourcing application

There are three tables in the database. First, the 'Users' table which stores user data for authentication purposes. This is a temporary table for the development phase to test authentication but in the future, it should be replaced with StatCan's Azure active directory. The second table is the 'Crowdsourcing' table which will store the crowdsourcing app name, the form data which will have all questions and the user interface (UI) information and will be linked to a user with data custodian rights. Finally, the 'Answers' table will store all the submissions/answers by the participating user.

  • USERS
  • User_id: BIGINT (20)
  • Is_Custodian: BOOLEAN
  • First_name: VARCHAR(50)
  • Last_name: VARCHAR(50)
  • Email_id: VARCHAR(50)
  • Account_created_on: DATETIME
  • CROWDSOURCING
  • Crowdsourcing_id: BIGINT(20)
  • Crowdsourcing_name: VARCHAR(20)
  • Form_content: JSON
  • User_id: BIGINT(20)
  • ANSWERS
  • User_id: BIGINT(20)
  • Answer: TEXT
  • created_at: DATETIME
  • updated_at: DATETIME
  • Status: TEXT

Crowdsourcing Builder

The Crowdsourcing Builder is a feature that includes existing interfaces with design templates which can be used to build crowdsourcing apps based on use cases. Data custodians can use Crowdsourcing Builder from the application itself to generate forms without writing code. These custom templates can then be hosted and configured in the application by the data custodians. The idea is to allow users to build and host many crowdsourcing pages using one common application.

Frontend

The final functionality of the application is its frontend. The frontend of a crowdsourcing application is the interface that users interact with. It includes graphical elements such as buttons, images, menus, and forms that allow users to perform tasks within the application. The frontend also provides visual feedback to help guide users through their tasks. The goal of a well-designed frontend is to make it easy for users to understand how they can use the application and quickly accomplish their goals.

Crowdsourcing homepage
Figure 3: Crowdsourcing homepage.

The home page of the app shows all the different crowdsourcing forms that have been created on the application. This includes food, receipts, satellite, crop, PDF and test crowdsourcing.

Text in image:

All Crowdsourcings

  1. Food Crowdsourcing
  2. Receipts Crowdsourcing
  3. Satellite Crowdsourcing
  4. Crop Crowdsourcing
  5. PDF Crowdsourcing
  6. Test Crowdsourcing
Crop Crowdsourcing Builder page
Figure 4: Crop Crowdsourcing Builder page.

This page of the app allows users to build a form for collecting data using a drag and drop builder. There are various components in the list on the right side which can be dragged left and rearranged based on the format of crowdsourcing. Labels can be edited to best describe the data which will be collected using that field.

The crowdsourcing output page
Figure 5: The crowdsourcing output page.

Once the form is built using Crowdsourcing Builder, a link will be generated which will be used to submit data when the public takes part in the crowdsourcing. The above image is the output page of how a crop crowdsourcing page could look to a user. There's the option to indicate the crop name and upload the crop image before clicking on submit.

Potential challenges

  • Ensuring security: One of the biggest challenges while developing a generalized crowdsourcing application is ensuring that all user data and interactions are secure. This includes protecting users' personal information.
  • Creating an engaging UI: Building an intuitive and engaging UI is essential for any successful crowdsourcing application. Designing a UI that appeals to both new and experienced users can be difficult, so developers must ensure that the features are easy to use yet powerful and flexible enough to meet their needs.
  • Implementing quality control measures: It's important to implement quality control measures to ensure that only high-quality tasks and results get posted. Measures include cross-checking the data submitted by the users in real time, such as image quality standards check, grammar check, sensitive data checks, and uploaded file extension verification. As this generalized application collects multiple formats of data, it's important to develop an extremely time efficient algorithm that can cross check the above-mentioned quality measures and notify the user if the uploaded results pass the quality check.

Conclusions

We have discussed how a single application can be built to perform the crowdsourcing task for different types of structured and unstructured data. This will allow an organization to investigate alternative data and use innovative methods to collect data and develop different solutions. It will also allow us to engage with the public to better understand issues at the planning or design stage of new projects. Crowdsourcing is a modern approach to collect data from audiences who are interested in bringing change and engage in the process of improving new statistics. By combining this with machine learning processing techniques, we can create new solutions which were not possible before due to limited and costly data.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, February 16
2:00 to 3:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

References

Statistics Canada (2008). Crowdsourcing: Impacts of COVID-19 on Canadians' Experiences of Discrimination Public Use Microdata File (accessed January 6, 2023).

Statistics Canada (n.d.-a). Crowdsourcing – Cannabis. Last updated January 22, 2020 (accessed January 6, 2023).

Statistics Canada (n.d.-b). Statistics Canada Data Strategy. Last updated August 16, 2022 (accessed January 6, 2023).

Date modified: