The Data Scientist Role: Submitting a Job to a Remote Network
In Part 1, we successfully simulated the entire federated learning workflow on our local machine. We saw how Data Owners (DOs) could set up their datasites and how a Data Scientist (DS) could submit a job to train a model across them. This local run was crucial for understanding the mechanics of the process. Now, we’ll take the next logical step: moving from simulation to a real, distributed network. In this part, you will act as the Data Scientist, but instead of connecting to local datasites, you will submit your training job to remote data owners running on the SyftBox network. The exciting part? The code and workflow remain almost exactly the same, showcasing the power of the syft_flwr
abstraction layer.
If you are already a Federated Learning practitioner, consider our Federated Learning Co-Design Program. You will get direct support from the OpenMined team to build production ready federated learning solutions.
Step 1: Switching to Remote Mode
The only change we need to make in our Data Scientist notebook (ds.ipynb
) is to switch off local testing. This tells SyftBox to stop simulating the network locally and instead connect to real, remote datasites.
In your ds.ipynb
notebook, find the cell where LOCAL_TEST
is defined and set it to False
. Ensure your SyftBox client is running in your terminal, as it handles the communication between datasites.
Step 2: Connecting to Remote Data Owners
With LOCAL_TEST
now False
, our connection code points to the public identifiers of the remote datasites. Part 1 established a pattern of using email-style addresses as unique identifiers, and we continue that here. For this tutorial, we will connect to two datasites hosted by OpenMined with emails flower-test-group-1@openmined.org
and flower-test-group-2@openmined.org
.
from pathlib import Path
from syft_rds.orchestra import setup_rds_server
# This flag is the only change needed to go from local to remote
LOCAL_TEST = False
import syft_rds as sy
from syft_core import Client
DS = Client.load().email
# We now use the public identifiers of the remote datasites
DO1 = "flower-test-group-1@openmined.org"
DO2 = "flower-test-group-2@openmined.org"
# Initialize sessions with the remote hosts
do_client_1 = sy.init_session(host=DO1)
print("Logged into: ", do_client_1.host)
do_client_2 = sy.init_session(host=DO2)
print("Logged into: ", do_client_2.host)
do_clients = [do_client_1, do_client_2]
do_emails = [DO1, DO2]
Step 3: The Workflow Stays the Same
This is the most powerful takeaway of the syft_flwr
framework. From this point on, the Data Scientist’s workflow is identical to the one we followed in the local simulation. You would still:
- Explore the mock datasets from the DOs to understand their structure.
- Bootstrap the
syft_flwr
project to configure it for the remote participants. - (Optional) Run simulations using
syft_flwr.run()
to ensure the code is correct before submission. - Submit the job to the remote datasites.
When you run the job submission code, your syft_flwr
project is securely sent across the internet to the remote datasites. On the other end, the Data Owners would receive a notification and follow the same review and approval process we saw in Step 7 of Part 1. For this tutorial, the remote datasites are configured to automatically approve the job, allowing training to begin immediately.
What We Have Accomplished in Part 2
You have now successfully acted as a Data Scientist in a real-world federated learning scenario. You submitted a job to train a federated model to remote participants without ever having direct access to their infrastructure or their private data. The syft_flwr
framework leveraging the Syftbox Network handles the complex networking, security, and orchestration, allowing you to focus on the machine learning task.
But how are these public, remote datasites created in the first place? In our final part, we will switch hats one last time to become a Data Owner and learn how to set up our own persistent, public datasite.
Skip Ahead and Start Building for Production?
We invite data scientists, researchers, and engineers working on production federated learning use cases to check out and apply to our Federated Learning Co-Design Program (No commitments).
Have questions?
- Join the conversation in our Slack Community
- Already in the OpenMined workspace? Join the
#community-federated-learning
channel