FAQs

for Researchers
for Data Owners

How does Syft ensure the analysis doesn’t reveal personal or protected information?

Syft preserves data privacy by enabling data scientists to study data without ever acquiring a copy. In traditional data science flows, researchers see raw private data during the process, which they can copy or remember. Syft avoids this problem, protecting the privacy of the data, through several groundbreaking features it supports:

Mock-Data-Based Prototyping: The first reason researchers have to look at data is that they need something to use to write their research code. Syft gets around this problem by giving researchers that “something” in a way that preserves privacy — a fake/mock version of the real dataset — identical in every way except the actual values of the data are randomized.

Manual Code Review: The second reason researchers have to look at data is because they can run any computation on the data and look at the results. Research/data science is inherently iterative, and researchers really want to be able to iterate. Syft gets around the need for researchers to see the data while preserving their ability to iterate. The simplest way Syft does this is by empowering researchers to submit their code — prototyped against mock data — to the data owner’s staff for review and execution.

Automation Using Privacy-Enhancing Technologies: While the first two features above can (in theory) support any type of researcher accessing any type of dataset without seeing it — it involves researchers waiting for code reviews and data owners spending time reviewing code. Syft overcomes this problem by introducing a variety of privacy-enhancing technologies — access control, federated learning, differential privacy, zero-knowledge proofs, etc. — which enable a data owner to automatically approve some requests, giving researchers instant results and allowing data owners to avoid manual review.

What security measures does Syft implement to ensure that personal or protected information isn’t compromised?

Syft creates solid, traditionally accepted barriers between external researchers and internal computers holding data. In its most secure setting, Syft puts an “air gap” between them, although it can also be configured for a similar “VPN-gapped” configuration.

In these configurations, an internal employee is responsible for moving assets across the “airgap.” Syft provides secure serialization support for objects moving across the “airgap” and convenient triaging to ensure this process is efficient and secure.

In its highest security setting, no research code runs on an organization’s private data without its employees observing every line.

While Syft includes tooling to increase efficiency and provide some automation around the process (approvals, pre-approvals, PETs, etc.), it can do so while upholding this principle.

How can I do data science without getting a copy of the data?

With PySyft, data owners expose a mock versions of their data which allows you to you to build and test your code without ever seeing the real, sensitive data. Once the analysis is ready, the code is submitted to the data owner for review to ensure compliance with data release policies. If approved, the code runs securely on the actual data and only the result is shared with you.

Where else is Syft deployed?

We have unlocked sensitive user data in places such as Microsoft and GDPR-protected data in areas such as Dailymotion through our partnership with the Christchurch Call Initiative on Algorithmic Outcomes, where external data scientists could perform research on production AI recommender models.

We also have active projects with the US Census Bureau, the Italian National Institute of Statistics (Istat), Statistics Canada (StatCan), and the United Nations Statistics Division (UNSD) to demonstrate how joint analysis across restricted statistical data can work internationally across national statistics offices.

Additionally, we have an active project with the xD team at the US Census Bureau to help make their Title-13 and Title-26 protected data available for research.

We are also the infrastructure of choice for Reddit’s external researcher program.

Data Science

What type of data analysis can I perform?

Syft allows data scientists to use any Python library to securely analyze private data, AI models, software pipelines, or APIs by submitting Python code to run on the data owner’s server. 

This flexibility enables diverse projects, such as:

Statistics: Perform statistical analysis on private data held by another university.

AI/Machine Learning: Train an AI classifier to detect cancer on data distributed across five medical centers (see OpenMined’s case studies of this in Nature Medicine).

Social Media Analysis: Remotely perform analysis against the user log of a social media company (see OpenMined’s case studies of this with LinkedIn, DailyMotion,and Reddit.

Syft is still under active development and has the following constraints:

Language Support: Syft only supports Python for data science/API code (experimental: a data scientist can submit docker files to configure the runtime in which their Python code executes, which theoretically means they could do some non-Python work).

Python Library Support: Any Python library installable from an internet repo (like PyPI).

Scale: While Syft can theoretically support any scale (assuming the data owner has provisioned enough CPUs and disk space in their Syft. Kubernetes cluster), we currently test Syft on datasets up to 1 TB and multi-node Kubernetes clusters.

GPU Support (Beta): Syft has experimental support for GPUs but is still under active development.

How can I ensure my analysis was done on the right assets?

If you’re asking whether a data owner might accidentally run your code on the wrong assets — Syft has you covered. When you submit your code for execution, it comes with an identifier that links it to the real data you specify (the real version of the mock data you prototyped with). With this infrastructure in place, we’re optimistic that accidental swapping of datasets is unlikely to go undetected.

However, if you’re instead worried about a malicious data owner intentionally running your computation on the wrong assets, we recommend using Syft with a zero-knowledge proof library.

Furthermore, we’re working on some extensions of Syft that would make it possible for your code to receive cryptographic evidence that your results came from your code + the right asset. It is, however, a nuanced topic — so please reach out to our team via support@openmined.org if you’d like to talk more about it.

Data science can often be very exploratory. How can this work in Syft if I cannot see or access the data?

Anything you can do when you have a copy of the data, you can do with Syft. The only difference is — in some cases — you have to wait for someone to approve your code before you see the results.

Consider the following analogy: You have a plethora of big-data tools you use to explore data that is too big for you to read, and those tools will also transfer when you need to analyze data that is too private to read. You can run summary statistics, you can ask for samples of the data, you can train covariate models — and all the other common practices you might be interested in doing when exploring the data. Syft is built to help you explore data and get the answers you’re looking for — in data you normally couldn’t access.

What type of datasets does Syft support?

A: Syft supports any dataset that can be represented as a Python object (e.g., PyTorch/Tensorflow/NumPy tensor, Pandas dataframe, etc.) Syft natively supports common data science objects, and you can add more (for details on how to add new types — see Syft’s documentation).

How do I deal with data quality issues when I can’t see the data?

With Syft, anything you can do with data that is too big to read — you can do with data that is too private to read. But what does this mean in practice — especially when you have dirty data?

As a first line of defense, before you submit queries to the real data, you first design those queries against the mock data. We encourage data owners to ensure their mock data mirrors data quality issues present in their real data — so that you (when you design your queries) can keep those defects in mind. In practice, this helps a lot.

As a primary line of defense against dirty data, you can submit any query to run on the real data. If you’re suspicious there might be some quality issues, use the same queries you’d use if the data was too big to read. Perform descriptive statistics and ask to see a handful of samples your algorithm finds highly suspect.

All the tools of Python are at your disposal — the only thing that’s different between using Syft and having a copy of the data yourself is that there’s a person (or their privacy configurations) sitting between you and the real data — but you can still query just as much.

As a path of last resort, if you need to ask for too many samples of the data to help debug issues with it — you’ll find yourself partnering with the data owner on the debugging process. While this isn’t the ideal case, it turns out this is the right answer. The data owner also doesn’t want dirty data (they’re usually offering it for research for a reason!) In practice, they’ll be grateful that you found an issue and interested in working with you to rectify it.

How do I know if the code I submit is actually run on the real data? 

When using Syft, you prototype your code on an exact replica of the real data so that — after you submitted it to a data owner for execution — that data owner can run your code unaltered.

If you’re worried about the data owner changing your code in some way, you can use Syft with zero-knowledge proofs to prevent this. We’re also working on functionality that would make it nearly impossible for a data owner to swap out your code for something else (using secure enclaves with code attestation). Since this is a nuanced topic, please reach out via support@openmined.org if you’d like to discuss it with our team.

How can I conduct an analysis that requires data joins between private datasets owned by multiple organizations?

Syft has experimental support for secure enclave. With secure enclaves, you’ll download mock datasets from multiple organizations, design code that uses both, and then submit that code to a secure enclave for execution. To facilitate your project, Syft will orchestrate the various approvals necessary from the organizations to allow your computation — notifying you when your computation is complete. If you are interested using our experimental stack contact us: support@openmined.org

Since Syft sends my code to the data owners, how can I study someone’s data if I need to keep my research code/data secret from the data owner?

Syft has experimental support for secure enclave which allows you to keep your research code/data/model secret. To do this, you would use Syft to select a cloud-hosted secure enclave, and then, instead of submitting your code to the data owner directly, you would submit it to this enclave. Syft would then handle the approvals necessary for the data owner to upload their assets to the enclave to run on your code. Because of the special properties of the enclave, your code (and their datasets) can remain secret from each other (and from the cloud-enclave provider!) Altogether, you can keep your results secret with Syft. If you are interested using our experimental stack contact us: support@openmined.org

Privacy

What are PETs, and how do they work?

Privacy Enhancing Technologies (PETs) are a variety of algorithms that seek to amplify privacy protection, and we (and the internet) offer many resources to learn about them. However, it’s worth mentioning that PETs are often described poorly — as “end all be all” solutions to solving privacy. In our opinion, PETs are actually about making privacy protection more scalable. That is to say; any data-owning organization can protect its privacy by hand-reviewing all code that’s run on its servers and all results that are sent out to external researchers.

However, this is intractably expensive to do in many cases — and as a result, privacy is difficult to protect. With a few exceptions, privacy enhancing technologies are actually about scaling privacy protection, more so than providing it outright. For more on our philosophy of privacy-enhancing technologies, see [this academic paper or this free course. If you want to learn more about PETs on your own, look up terms like: Differential privacy, Federated Learning, Secure Enclaves, Zero-knowledge proofs, Secure Multi-party Computation, Homomorphic Encryption

Do I need PETs expertise to use it?

No, you only need knowledge of Python and data science. In some cases, a data owner will use PETs to automate their processes, but they will provide clear instructions for you when they offer this kind of API.

Does mock data leak private data?

No, while the mock dataset is as identical as possible to real data (same/similar column types, same/similar number of rows, etc.) — the values of the mock dataset are randomly generated.

Is Syft compliant with GDPR/CCPA?

Syft has unlocked sensitive GDPR and CCPA-protected user data in places like Microsoft, Reddit, and Dailymotion.

Is Syft compliant with HIPAA?

A: Syft can be part of your HIPAA-compliant system, particularly when using manual code review-based approaches and, in some cases, PETs.

Cost

Does Syft cost money?

No, Syft is free and open source, as built by the OpenMined Foundation.

Who pays for the computation?

Since data scientists run their analysis on a server owned by the data owner, the data owner pays for computation by default. If computation is a blocker for your integration, OpenMined may be able to help – contact us for more information: support@openmined.org

What should I do if the data owner doesn’t have enough computational power to support my study?

If computation is a blocker for your integration, OpenMined may be able to help – contact us for more information: support@openmined.org

I emailed an organization who has data and they told me “No” — what do I do now?

Email us and tell us about it — perhaps we can help: support@openmined.org

How does Syft ensure the analysis doesn’t reveal personal or protected information?

Syft preserves data privacy by enabling data scientists to study data without ever acquiring a copy. In traditional data science flows, researchers see raw private data during the process, which they can copy or remember. Syft avoids this problem, protecting the privacy of the data, through several groundbreaking features it supports:

Mock-Data-Based Prototyping: The first reason researchers have to look at data is that they need something to use to write their research code. Syft gets around this problem by giving researchers that “something” in a way that preserves privacy — a fake/mock version of the real dataset — identical in every way except the actual values of the data are randomized.

Manual Code Review: The second reason researchers have to look at data is because they can run any computation on the data and look at the results. Research/data science is inherently iterative, and researchers really want to be able to iterate. Syft gets around the need for researchers to see the data while preserving their ability to iterate. The simplest way Syft does this is by empowering researchers to submit their code — prototyped against mock data — to the data owner’s staff for review and execution.

Automation Using Privacy-Enhancing Technologies: While the first two features above can (in theory) support any type of researcher accessing any type of dataset without seeing it — it involves researchers waiting for code reviews and data owners spending time reviewing code. Syft overcomes this problem by introducing a variety of privacy-enhancing technologies — access control, federated learning, differential privacy, zero-knowledge proofs, etc. — which enable a data owner to automatically approve some requests, giving researchers instant results and allowing data owners to avoid manual review.

What security measures does Syft implement to ensure that personal or protected information isn’t compromised?

Syft creates solid, traditionally accepted barriers between external researchers and internal computers holding data. In its most secure setting, Syft puts an “air gap” between them, although it can also be configured for a similar “VPN-gapped” configuration.

In these configurations, an internal employee is responsible for moving assets across the “airgap.” Syft provides secure serialization support for objects moving across the “airgap” and convenient triaging to ensure this process is efficient and secure.

In its highest security setting, no research code runs on an organization’s private data without its employees observing every line.

While Syft includes tooling to increase efficiency and provide some automation around the process (approvals, pre-approvals, PETs, etc.), it can do so while upholding this principle.

How do I know my data is safe in my Datasite?

Datasite servers are hosted within your internal infrastructure, ensuring that only you have access to the private data they contain. You have full control over approving any data science code that runs on the server, and only you can extract experimental results to share with the data scientist. In essence, you maintain complete control over your data and the entire process.

Where else is Syft deployed?

We have unlocked sensitive user data in places such as **Microsoft** and GDPR-protected data in areas such as **Dailymotion** through our partnership with the **Christchurch Call Initiative on Algorithmic Outcomes**, where external data scientists could perform research on production AI recommender models.

We also have active projects with the US Census Bureau, the **Italian National Institute of Statistics (Istat), Statistics Canada (StatCan), and the United Nations Statistics Division (UNSD) to demonstrate how joint analysis across restricted statistical data can work internationally across national statistics offices.

Additionally, we have an active project with the xD team at the US Census Bureau to help make their Title-13 and Title-26 protected data available for research.

We are also the infrastructure of choice for Reddit’s external researcher program.

Security

How does a researcher send my Datasite code if only I can access it the Datasite?

Taking inspiration from how the intelligence community handles IT security, your Datasite is actually made up of two data servers. They are identical in every way, except one server (the “high side”) holds the real data, while the other (the “low side”) holds a fake version of the real data. The researcher only has access to the “low side”, which they use to craft their experimental code, submit their experimental code, and wait for their results. You (the Data Owner) are responsible for moving code/results between high and low side servers.

Privacy

What are PETs, and how do they work?

Privacy Enhancing Technologies (PETs) are a variety of algorithms that seek to amplify privacy protection, and we (and the internet) offer many resources to learn about them. However, it’s worth mentioning that PETs are often described poorly — as “end all be all” solutions to solving privacy. In our opinion, PETs are actually about making privacy protection more scalable. That is to say; any data-owning organization can protect its privacy by hand-reviewing all code that’s run on its servers and all results that are sent out to external researchers.

However, this is intractably expensive to do in many cases — and as a result, privacy is difficult to protect. With a few exceptions, privacy enhancing technologies are actually about scaling privacy protection, more so than providing it outright. For more on our philosophy of privacy-enhancing technologies, see [this academic paper or this free course. If you want to learn more about PETs on your own, look up terms like: Differential privacy, Federated Learning, Secure Enclaves, Zero-knowledge proofs, Secure Multi-party Computation, Homomorphic Encryption

Do I need PETs expertise to use it?

No! Syft only assumes that you know Python and data science. You only need more advanced PETs expertise for adding automation to the system, and only for PETs you elect to use.

Does mock data leak private data?

No, while the mock dataset is as identical as possible to real data (same/similar column types, same/similar number of rows, etc.) — the values of the mock dataset are randomly generated.

Is Syft compliant with GDPR/CCPA?

Syft has unlocked sensitive GDPR and CCPA-protected user data in places like Microsoft, Reddit, and Dailymotion.

Is Syft compliant with HIPAA?

Syft can be part of your HIPAA-compliant system, particularly when using manual code review-based approaches and, in some cases, PETs.

Data

What type of datasets does Syft support?

Syft supports any dataset that can be represented as a Python object (e.g., PyTorch/Tensorflow/NumPy tensor, Pandas dataframe, etc.) Syft natively supports common data science objects, and you can add more (for details on how to add new types — see Syft’s documentation).

How do I generate mock data? Does Syft include tools for doing this?

There is a robust ecosystem of Python tools for this. Here’s how to use them →

Cost

Does Syft cost money?

No, Syft is free and open source, as built by the OpenMined Foundation.

Who pays for the computation?

Since data scientists run their analysis on a server owned by the data owner, the data owner pays for computation by default. If computation is a blocker for your integration, OpenMined may be able to help – contact us for more information: support@openmined.org

What should I do if I don’t have enough computational power to support my study?

If computation is a blocker for your integration, OpenMined may be able to help – contact us for more information: support@openmined.org

Lets get started.

Next Steps