PySyft FAQs

How can I possibly do data science without getting a copy of the data?

Start using the tutorials and you’ll see. But at a high level, you already work with data that is too big for you to read. The same techniques cross-over pretty well to when data is too private for you to read.

I emailed an organization who has data and they told me “No” — what do I do now?

Email us and tell us about it — perhaps we can help: support@openmined.org

Does Syft work with GPUs?

[Medium term] Yes.

Does Syft cost money?

No, Syft is free and open source, as built by the OpenMined Foundation.

Who pays for the computation?

Since data scientists run their analysis on a server owned by the data owner, the data owner pays for computation by default. If computation is a blocker for your integration, OpenMined may be able to help – contact us for more information.

Data Science Capabilities

What type of data analysis can I perform?

See previous section.

How can I conduct an analysis that requires data joins between private datasets owned by multiple organizations?

You can perform joins using Syft on a secure enclave. To do this, you’ll download mock datasets from multiple organizations, design code that uses both, and then submit that code to a secure enclave for execution. To facilitate your project, Syft will orchestrate the various approvals necessary from the organizations to allow your computation — notifying you when your computation is complete.

Since Syft sends my code to the data owners, how can I study someone’s data if I need to keep my research code/data secret from the data owner?

You can keep your research code/data/model secret using Syft and a secure enclave. To do this, you would use Syft to select a cloud-hosted secure enclave, and then, instead of submitting your code to the data owner directly, you would submit it to this enclave. Syft would then handle the approvals necessary for the data owner to upload their assets to the enclave to run on your code. Because of the special properties of the enclave, your code (and their datasets) can remain secret from each other (and from the cloud-enclave provider!) Altogether, you can keep your results secret with Syft.

Do I need PETs expertise to use it?

No, you only need knowledge of Python and data science. In some cases, a data owner will use PETs to automate their processes, but they will provide clear instructions for you when they offer this kind of API.

How can I conduct an analysis that requires data joins between private datasets owned by multiple organizations?

What are PETs, and how do they work?

PETs are a variety of algorithms that seek to amplify privacy protection, and we (and the internet) offer many resources to learn about them. However, it’s worth mentioning that PETs are often described poorly — as “end all be all” solutions to solving privacy. In our opinion, PETs are actually about making privacy protection more **scalable.** That is to say; any data-owning organization can protect its privacy by hand-reviewing all code that’s run on its servers and all results that are sent out to external researchers.

However, this is ***intractably expensive to do in many cases*** — and as a result, privacy is difficult to protect. With a few exceptions, privacy enhancing technologies are actually about **scaling** privacy protection, more so than providing it outright. For more on our philosophy of privacy-enhancing technologies, see [this academic paper](https://arxiv.org/abs/2012.08347) or [this free course](https://courses.openmined.org/courses/our-privacy-opportunity). If you want to learn more about PETs on your own, google search for terms like the following:

*Differential privacy, Federated Learning, Secure Enclaves, Zero-knowledge proofs, Secure Multi-party Computation, Homomorphic Encryption*

How can I be sure that the analysis output is accurate?

To use the language of structured transparency, you’re asking about “[output verification](https://arxiv.org/abs/2012.08347).” If you’re worried about how to write good data science code, we recommend tutorials. However, if you’re asking, “how do I know if the code I submit is actually run on the real data” — Syft has you covered. From a computational perspective, you write the code that gets run on the real data.

When using Syft, you prototype your code on an exact replica of the real data so that — after you submitted it to a data owner for execution — that data owner can run your code unaltered.

If you’re worried about the data owner changing your code in some way, you can use Syft with zero-knowledge proofs (experimental) to prevent this. We’re also working on functionality that would make it nearly impossible for a data owner to swap out your code for something else (using secure enclaves with code attestation).

Since this is a nuanced topic, please reach out via OpenMined’s Slack if you’d like to discuss it with our team.

How can I ensure my analysis was done on the right assets?

Great question! To use the language of structured transparency, you’re asking about “[input verification](https://arxiv.org/abs/2012.08347).”

If you’re asking whether a data owner might **accidentally** run your code on the wrong assets — Syft has you covered. When you submit your code for execution, it comes with an identifier that links it to the real data you specify (the real version of the mock data you prototyped with). With this infrastructure in place, we’re optimistic that accidental swapping of datasets is unlikely to go undetected.

However, if you’re instead worried about a malicious data owner **intentionally** running your computation on the wrong assets, we recommend using Syft with a zero-knowledge proof library (experimental).

Furthermore, we’re working on some extensions of Syft that would make it possible for your code to receive cryptographic evidence that your results came from your code + the right asset. It is, however, a nuanced topic — so please reach out to our team via OpenMined’s Slack if you’d like to talk more about it.

How do I deal with data quality issues when I can’t see the data?

We get this question all the time, and our response is this: anything you can do with data that is too big to read — you can do with data that is too private to read. But what does this mean in practice — especially when you have dirty data?

As a first line of defense, before you submit queries to the real data, you first design those queries against the mock data. We encourage data owners to ensure their mock data mirrors data quality issues present in their real data — so that you (when you design your queries) can keep those defects in mind. In practice, this helps a lot.

As a primary line of defense against dirty data, you can submit any query to run on the real data. If you’re suspicious there might be some quality issues, use the same queries you’d use if the data was too big to read. Perform descriptive statistics and ask to see a handful of samples your algorithm finds highly suspect.

All the tools of Python are at your disposal — the only thing that’s different between using Syft and having a copy of the data yourself is that there’s a person (or their privacy configurations) sitting between you and the real data — but you can still query just as much.

As a path of last resort, if you need to ask for too many samples of the data to help debug issues with it — you’ll find yourself partnering with the data manager working for the data owner on the debugging process. While this isn’t the ideal case, it turns out this is the right answer. The data owner also doesn’t want dirty data (they’re usually offering it for research for a reason!) In practice, they’ll be grateful that you found an issue and interested in working with you to rectify it.

Data science can often be very exploratory. How can this work in Syft if I cannot see or access the data?

Anything you can do when you have a copy of the data, you can do with Syft. The only difference is — in the worst case — you have to wait for someone to approve your code before you see the results.

Consider the following analogy: You have a plethora of big-data tools you use to explore data that is too big for you to read, and those tools will also transfer when you need to analyze data that is too private to read. You can run summary statistics, you can ask for samples of the data, you can train covariate models — and all the other common practices you might be interested in doing when exploring the data. Syft is built to help you explore data and get the answers you’re looking for — in data you normally couldn’t access.

What should I do if the data owner doesn’t have enough computational power to support my study?

If you are working with a data owner deployed in the cloud, you should be able to use Syft to get contact details for your data owner — and request a specific pool of resources spun up for your research. If the data owner does not have that capacity, OpenMined may be able to help — contact us for more information.

Privacy

How does Syft ensure the analysis doesn’t reveal personal or protected information?

See the section above. Syft follows the paradigm of Structured Transparency, which allows the use of various PETs before, during, and after computation to facilitate data privacy and security throughout its lifecycle.

Do I need PETs expertise to use it?

No! Syft only assumes that you know Python and data science, which is all that is required for low-scale use cases (see the first two bullets above). You only need more advanced PETs expertise for adding automation to the system, and only for PETs you elect to use.

Is Syft compliant with GDPR/CCPA?

Syft has unlocked sensitive GDPR and CCPA-protected user data in places like Microsoft, Reddit, and Dailymotion.

Is Syft compliant with HIPAA?

Syft can be part of your HIPAA-compliant system, particularly when using manual code review-based approaches and, in some cases, PETs.

Does mock data leak private data?

No, while the mock dataset is as identical as possible to real data (same/similar column types, same/similar number of rows, etc.) — the values of the mock dataset are randomly generated.

How do I generate mock data? Does Syft include tools for doing this?

There is a robust ecosystem of Python tools for this. Here’s how to use them <link>.

What type of datasets does Syft support?

Syft supports any dataset that can be represented as a Python object (e.g., PyTorch/Tensorflow/NumPy tensor, Pandas dataframe, etc.) Syft natively supports common data science objects, and you can add more (for details on how to add new types — see Syft’s documentation for adding types to our secure serialization process).

Can I use Spark in Syft? How can I work with very large amounts of data?

You cannot use Spark, but Syft has a similar system for running long-running jobs across a (Kubernetes) cluster of machines.

How do I deal with data quality issues when I can’t see the data?

Data science can often be very exploratory. How can this work in Syft if I cannot see or access the data?

What should I do if the data owner doesn’t have enough computational power to support my study?

Security

What security measures does Syft implement to ensure that personal or protected information isn’t compromised?

See previous section.

Credibility

Where else is Syft deployed?

We have unlocked sensitive user data in places such as Microsoft and GDPR-protected data in areas such as Dailymotion through our partnership with the Christchurch Call Initiative on Algorithmic Outcomes, where external data scientists could perform research on production AI recommender models.

We also have active projects with the US Census Bureau, the Italian National Institute of Statistics (Istat), Statistics Canada (StatCan), and the United Nations Statistics Division (UNSD) to demonstrate how joint analysis across restricted statistical data can work internationally across national statistics offices.

Additionally, we have an active project with the xD team at the US Census Bureau to help make their Title-13 and Title-26 protected data available for research.

We are also the infrastructure of choice for Reddit’s external researcher program.