Privacy enhancing technologies: An overview of federated learning

By: Julian Templeton, Statistics Canada

Introduction

National statistical offices (NSOs) use collected data to provide insights on various topics for public good. Despite their access to large amounts of data, there are limitations on what can be collected or shared, regardless of the benefits. NSOs must ensure that personal information remain private, including any releases that involve data. In Canada, there are laws that mandate the protection of confidential data, of which Statistics Canada must abide. When sensitive data is collected, not everyone will trust it will remain private and protected and can lead some people to be hesitant in sharing their data.

To enhance data privacy throughout the public sector and to enable new opportunities for data collection, data sharing, and data use, various NSOs are actively exploring innovative ways to use and collect data privately. An emerging set of techniques currently being investigated are called privacy enhancing technologies (PETs) or privacy preserving technologies (PPTs) (see: A Brief Survey of Privacy Preserving Technologies).

There are many different types of PETs, some include:

  • Federated learning (FL): Helps build machine learning (ML) models from distributed data, which stays on a client's device and is not collected. Clients use their data and device to train local ML models which are then collected and compiled into a central model. This is a subset of distributed ML and we will discuss this in greater detail in this article.
  • Homomorphic encryption: Allows mathematical operations to be made on encrypted data to maintain privacy while the data is in use. For more information on this, see a recent DSN article – Privacy Preserving Technologies Part Two: Introduction to Homomorphic Encryption.
  • Trusted execution environments: Isolated virtual environments, also named secure enclaves, which can run code without being accessible anywhere else.
  • Differential privacy: Adds noise to data so occasional changes can be made to the data. This helps to protect the data and provides plausible deniability such that a single data entry may have been modified from its original state. Removing a single training sample from the training set should not impact the overall results.
  • Secure multiparty computation: Allows two or more parties to securely and jointly perform functions on their data.

All PETs listed above offer a unique method of enhancing privacy, however, each PET has its own drawbacks and must be selected based on the use case being derived. While no single PET is a universal solution to privacy issues, different PETs can be used in conjunction with one another to provide better overall privacy. Statistics Canada is in the research phase for PETs. It's becoming clear that widespread adoption of PETs in the public and private sectors will be required as data privacy becomes more common and as more data privacy laws are passed.

One method of exploring statistics is through the use of ML models, which aim to learn patterns through data and provide some target output. Different NSOs are already using or are beginning to use ML to support internal processes, ease the burden on analysts, and improve overall efficiencies. A challenge with ML is that the quality of the data used is important in achieving a well-performing model. A common saying in the field of artificial intelligence, and even other data science fields, is “garbage in, garbage out”. Fortunately, NSOs hold high-quality data that can be appropriately and ethically used to train high quality ML models (though, this article will speak solely on data privacy). However, it can be challenging to acquire quality data on sensitive topics as well as legally protected data to explore statistics on specific domains.

Of all the PETs presented above, FL is the approach that can generate ML models with sensitive or legally protected data, assuming the clients or collaborators agree. This article will discuss FL and potential cases for its use in the public sector after more research is conducted.

Background on federated learning

FL is a distributed learning technique which aims to build a central ML model from distributed data sources, without collecting the data. The distributed data used for training the centralized ML model which a central authority will hold remains on the client devices and does not leave. Neural networks are used for FL since they use layers of numerical weights for learning which are easy to aggregate and share. Within the scope of this article, a client will refer to an individual or organization that holds relevant data they agree to use within the FL process in collaboration with the central authority. Examples of clients include crowdsourcing participants who use their laptops, phones, or tablets and organizations holding relevant data. A central authority refers to the individual or organization (such as NSOs or private companies) responsible for holding, updating, and potentially distributing the central ML model, trained on client devices before the collection of model weights from clients are sent over to be aggregated.

To use FL to train a ML model held and initialized by the central authority, without viewing the client's training data, the initial model should first be trained with data stored by the central authority. Next, the central authority will submit requests to a subset of the clients to train the model. If the client can train the model, they're sent the model and instructions for their device to perform the training. The clients will then train the provided ML model using the data stored on the clients' device.

After the models are locally trained, the client devices will return only the weights of the updated ML model, without the data used for training. These weights are numbers adjusted during training to learn from the data. The central authority then receives these weights from the clients and aggregates them to be used as updated weights for the centralized model. This results in a trained ML model held by the central authority without collecting or learning the data held by its clients. This article will not cover all technical aspects of this process or the different options available, however I highlight the process in the figure below.

 
Figure 1: An overview of the federated learning process
Figure 1: An overview of the federated learning process

An overview of the federated learning process. (1) The central authority has a model that needs to be trained. A request for training is provided to two clients who accept the request and receive the model. (2) The client devices use their local data to train the received model on their device. (3) After training, each client sends their weights back to the central authority for processing without the data. (4) The central authority takes the updated weights and computes the aggregate to update the model. w1,1 represents the first weight for layer one and wn,n represents the n¬th weight for the nth layer. (5) The central authority uses the updated weights to update the central model. (6) The updated model is broadcast to the clients to be held for use or further training. This process is repeated as needed.

Similar to other PETs, FL has libraries available so the technique can be used for research and production. However, there is still much to be implemented in these libraries before a fully robust open-source library is available for production. Other libraries are being developed on an ongoing basis, but even the most prominent libraries are not sophisticated enough to work with complex problems without strong programming skills to supplement the functionalities.

Use of federated learning for organizations

Since FL allows neural networks to be trained in a distributed environment without accessing the data, projects that were once impossible can now be considered. An example of a collaborative proof-of-concept (PoC) project was presented at the United Nations PET Lab from various NSOs. This PoC explained how to learn from distributed physical activity data using FL. This project used and distributed an open dataset on physical activity data among each NSO, where each NSO considers this data to be private within the scope of their project. Statistics Canada aims to learn from physical activity data of other NSOs as well as their own private physical activity data, by building a model with distributed data without collecting it (mitigating the legal and privacy concerns held by each NSO). Each NSO can then use the generated model for their own statistical purposes.

This project was successful in replicating a variety of FL scenarios where the distributed data generated a model held by the central authority without accessing or collecting any of the data. Furthermore, experiments involving the use of homomorphic encryption in addition to FL, by encrypting a subset of the model weights to keep them private, have also been successful within the project's scope.

This highlights a clear use regarding FL -- organizations that can't typically share sensitive data can still generate models to use for statistical purposes without disclosing the data. This can provide opportunity for projects to be performed with domains that are sensitive in nature or legally protected and would not otherwise happen, such as some interagency collaborations. Of course, NSOs and other government organizations are carefully researching and experimenting with PETs and their weaknesses before making any moves to operationalize the techniques. We'll get to those later.

Another potential use for PETs within the scope of an NSO is for crowdsourcing activities. However, for certain topics, participants may be reluctant to provide information regardless of the incentive. Therefore, by providing a secure application or webpage where users can participate without sharing the data, fewer users may be hesitant to participate. Still, there are challenges to identify and anticipate before this can be implemented, such as possible attacks and communication strategies.

Federated learning challenges

While FL and other PETs can seem like magical tools that can tackle any major privacy issue, there are challenges that must be considered. One single PET does not provide a complete mitigation of all privacy risks but will provide additional mitigations allowing endeavors that are otherwise impossible. The PET(s) to be used and the communication strategy that explains how a client's data is kept private are critical and will vary for each use case.

PETs are actively researched, and there are many attacks against and defenses for PETs that are being investigated. Traditional ML model attacks can still be implemented against certain PETs and still require defenses, making the PET act as an additional privacy measure which still requires defenses against the attacks. For example, a membership inference attack can be performed to determine whether data has been used to train a model. Since FL combines the collected model weights from clients, there's a degree of added defense against some attacks, but there are scenarios where the attacks can still be effective against the centralized model. NSOs are investigating these to help determine how to mitigate attacks and be prepared to safely operationalize PETs in the future.

While there are programming libraries available to use FL, not all are ready to be used in production systems without facing challenges or by using paid software (which still may not include all the features needed for a use case). Therefore, a major challenge for the PET community to overcome will be to continue developing open-source software for individuals or organizations to use with ease and confidence, beyond simulation settings.

The final core challenge to discuss is the communication strategies surrounding the use of FL. When organizations collaborate with FL, they can audit the codebase and collaborate on its development to ensure it's properly implemented and safe to use with their data. This makes it easier to use within professional collaborations where experts are available to evaluate and develop the systems. However, in a public setting the story is quite different. Each client will need to be convinced that the approach works and that their data will never actually leave their device. Given the general difficulties surrounding trust between users and organizations, this is a significant hurdle that PET communities and organizations will need to address.

Conclusions

FL is an important tool that can lead to opportunities that are not otherwise possible. By generating ML models without accessing the data used for training, NSOs can provide insights to the public that are otherwise impossible to provide. This technique is currently in use by private organizations and is actively being researched. NSOs are investigating PETs with the intention of fostering collaborations at a global level that can provide an overall benefit to society. This research can also extend to other public organizations and allow more interagency collaborations within the public service. While there is still a lot of work to be done at Statistics Canada before it can be operationalized, our continued research will lead to improved privacy and for more statistics to be generated.

There are many challenges with FL that need to be overcome, however, NSOs and international PET communities will continue to collaborate and use the technique in safe and effective ways, keeping privacy at the forefront of all initiatives. Each PET will undergo analysis on attacks so they're proven as strong privacy methods. All of this will need to be clearly communicated to the public and other organizations.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, December 15
2:00 to 3:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Date modified: