Lomas Client Side: Using Smartnoise-Synth
This notebook showcases how researcher could use the Secure Data Disclosure system. It explains the different functionnalities provided by the lomas-client
client library to interact with the secure server.
The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.
Each user has access to one or multiple projects and for each dataset has a limited budget with \(\epsilon\) and \(\delta\) values.
Step 1: Install the library
To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client
on her local developping environment.
It can be installed via the pip command:
[ ]:
# !pip install lomas_client
Or using a local version of the client
[2]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
[3]:
from lomas_client import Client
import numpy as np
Step 2: Initialise the client
Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.
To create the client, Dr. Antartica needs to give it a few parameters: - a url: the root application endpoint to the remote secure server. - user_name: her name as registered in the database (Dr. Alice Antartica) - dataset_name: the name of the dataset that she wants to query (PENGUIN)
She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit (as is done in the Admin Notebook for Users and Datasets management).
[4]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)
And that’s it for the preparation. She is now ready to use the various functionnalities offered by lomas-client
.
Step 3: Metadata and dummy dataset
Getting dataset metadata
Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the get_dataset_metadata()
function of the client. As this is public information, this does not cost any budget.
This function returns metadata information in a format based on SmartnoiseSQL dictionary format, where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details). Any metadata is required for Smartnoise-SQL is also required here and additional information such that the different categories in a string type column column can be added.
[5]:
penguin_metadata = client.get_dataset_metadata()
penguin_metadata
[5]:
{'max_ids': 1,
'row_privacy': True,
'censor_dims': False,
'columns': {'species': {'type': 'string',
'cardinality': 3,
'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
'island': {'type': 'string',
'cardinality': 3,
'categories': ['Torgersen', 'Biscoe', 'Dream']},
'bill_length_mm': {'type': 'float', 'lower': 30.0, 'upper': 65.0},
'bill_depth_mm': {'type': 'float', 'lower': 13.0, 'upper': 23.0},
'flipper_length_mm': {'type': 'float', 'lower': 150.0, 'upper': 250.0},
'body_mass_g': {'type': 'float', 'lower': 2000.0, 'upper': 7000.0},
'sex': {'type': 'string',
'cardinality': 2,
'categories': ['MALE', 'FEMALE']}},
'rows': 344}
Step 3: Create a Synthetic Dataset keeping all default parameters
We want to get a synthetic model to represent the private data.
Therefore, we use a Smartnoise Synth Synthesizers.
Let’s list the potential options. There respective paramaters are then available in Smarntoise Synth documentation here.
[8]:
from snsynth import Synthesizer
Synthesizer.list_synthesizers()
[8]:
['mwem', 'dpctgan', 'patectgan', 'mst', 'pacsynth', 'dpgan', 'pategan', 'aim']
AIM: Adaptive Iterative Mechanism
We start by executing a query on the dummy dataset without specifying any special parameters for AIM (all optional kept as default). Also only works on categorical columns so we select “species” and “island” columns to create a synthetic dataset of these two columns.
[9]:
res_dummy = client.smartnoise_synth_query(
synth_name="aim",
epsilon=1.0,
delta=0.0001,
select_cols = ["species", "island"],
dummy=True,
)
res_dummy['query_response'].head()
[9]:
species | island | |
---|---|---|
0 | Adelie | Dream |
1 | Gentoo | Torgersen |
2 | Adelie | Dream |
3 | Adelie | Dream |
4 | Gentoo | Torgersen |
The algorithm works and returned a synthetic dataset. We now estimate the cost of running this command:
[10]:
res_cost = client.estimate_smartnoise_synth_cost(
synth_name="aim",
epsilon=1.0,
delta=0.0001,
select_cols = ["species", "island"],
)
res_cost
[10]:
{'epsilon_cost': 1.0, 'delta_cost': 0.0001}
Executing such a query on the private dataset would cost 1.0 epsilon and 0.0001 delta. Dr. Antartica decides to do it with now the flag dummmy
to False and specifiying that the wants the aim synthesizer model in return (with return_model = True
).
NOTE: if she does not set the parameter return_model = True
, then it is False by default and she will get a synthetic dataframe as response directly.
[11]:
res = client.smartnoise_synth_query(
synth_name="aim",
epsilon=1.0,
delta=0.0001,
select_cols = ["species", "island"],
dummy=True,
return_model = True
)
res['query_response']
/usr/local/lib/python3.11/site-packages/mbi/__init__.py:15: UserWarning: MixtureInference disabled, please install jax and jaxlib
warnings.warn('MixtureInference disabled, please install jax and jaxlib')
[11]:
<snsynth.aim.aim.AIMSynthesizer at 0x75edc6e1ffd0>
She can now get the model and sample results with it. She choose to sample 10 samples.
[12]:
synth = res['query_response']
synth.sample(10)
[12]:
species | island | |
---|---|---|
0 | Gentoo | Biscoe |
1 | Gentoo | Dream |
2 | Gentoo | Dream |
3 | Adelie | Torgersen |
4 | Chinstrap | Dream |
5 | Adelie | Dream |
6 | Adelie | Torgersen |
7 | Chinstrap | Biscoe |
8 | Chinstrap | Torgersen |
9 | Chinstrap | Torgersen |
She now wants to specify some specific parameters to the AIM model. Therefore, she needs to set some parameters in synth_params
based on the Smartnoise-Synth documentation here. She decides that she wants to modify the max_model_size
to 50 (the default was 80) and tries on the dummy.
[13]:
res_dummy = client.smartnoise_synth_query(
synth_name="aim",
epsilon=1.0,
delta=0.0001,
select_cols = ["species", "island"],
dummy=True,
return_model = True,
synth_params = {"max_model_size": 50}
)
res_dummy['query_response']
[13]:
<snsynth.aim.aim.AIMSynthesizer at 0x75edc6883650>
[14]:
synth = res_dummy['query_response']
synth.sample(5)
[14]:
species | island | |
---|---|---|
0 | Gentoo | Dream |
1 | Adelie | None |
2 | Adelie | Biscoe |
3 | Gentoo | Torgersen |
4 | Chinstrap | Torgersen |
Now that the workflow is understood for AIM, she wants to experiment with various synthesizer on the dummy.
MWEM: Multiplicative Weights Exponential Mechanism
She tries MWEM on all columns with all default parameters. As return_model
is not specified she will directly receive a synthetic dataframe back.
[15]:
res_dummy = client.smartnoise_synth_query(
synth_name="mwem",
epsilon=1.0,
dummy=True,
)
res_dummy['query_response'].head()
[15]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Gentoo | Torgersen | 42.25 | 22.5 | 155.0 | 2250.0 | FEMALE |
1 | Chinstrap | Biscoe | 63.25 | 13.5 | 235.0 | 5250.0 | FEMALE |
2 | Gentoo | Torgersen | 42.25 | 22.5 | 155.0 | 2750.0 | FEMALE |
3 | Chinstrap | Torgersen | 63.25 | 13.5 | 235.0 | 5250.0 | FEMALE |
4 | Chinstrap | Dream | 63.25 | 13.5 | 235.0 | 5250.0 | FEMALE |
She now specifies 3 columns and some parameters explained here.
[16]:
res_dummy = client.smartnoise_synth_query(
synth_name="mwem",
epsilon=1.0,
select_cols = ["species", "island", "sex"],
synth_params = {"measure_only": False, "max_retries_exp_mechanism": 5},
dummy=True,
)
res_dummy['query_response'].head()
[16]:
species | island | sex | |
---|---|---|---|
0 | Chinstrap | Torgersen | FEMALE |
1 | Gentoo | Torgersen | FEMALE |
2 | Gentoo | Torgersen | FEMALE |
3 | Gentoo | Torgersen | FEMALE |
4 | Chinstrap | Biscoe | FEMALE |
Finally it MWEM, she wants to go more in depth and create her own data preparation pipeline. Therefore, she can use Smartnoise-Synth “Data Transformers” explained here and send her own constraints dictionnary for specific steps. This is more for advanced user.
By default, if no constraints are specified, the server creates its automatically a data transformer based on selected columns, synthesizer and metadata.
Here she wants to add a clamping transformation on the continuous columns before training the synthesizer. She add the bounds based on metadata.
[18]:
bl_bounds = penguin_metadata["columns"]["bill_length_mm"]
bd_bounds = penguin_metadata["columns"]["bill_depth_mm"]
bl_bounds, bd_bounds
[18]:
({'type': 'float', 'lower': 30.0, 'upper': 65.0},
{'type': 'float', 'lower': 13.0, 'upper': 23.0})
[19]:
from snsynth.transform import BinTransformer, ClampTransformer, ChainTransformer, LabelTransformer
my_own_constraints = {
"bill_length_mm": ChainTransformer(
[
ClampTransformer(lower = bl_bounds["lower"] + 10, upper = bl_bounds["upper"] - 10),
BinTransformer(bins = 20, lower = bl_bounds["lower"] + 10, upper = bl_bounds["upper"] - 10),
]
),
"bill_depth_mm": ChainTransformer(
[
ClampTransformer(lower = bd_bounds["lower"] + 2, upper = bd_bounds["upper"] - 2),
BinTransformer(bins=20, lower = bd_bounds["lower"] + 2, upper = bd_bounds["upper"] - 2),
]
),
"species": LabelTransformer(nullable=True)
}
[20]:
res_dummy = client.smartnoise_synth_query(
synth_name="mwem",
epsilon=1.0,
select_cols = ["bill_length_mm", "bill_depth_mm", "species"],
constraints = my_own_constraints,
dummy=True,
)
res_dummy['query_response'].head()
[20]:
bill_length_mm | bill_depth_mm | species | |
---|---|---|---|
0 | 48.625 | 16.05 | Chinstrap |
1 | 51.625 | 18.75 | Gentoo |
2 | 51.625 | 18.75 | Gentoo |
3 | 51.625 | 18.75 | Gentoo |
4 | 44.875 | 15.15 | Adelie |
Also a subset of constraints can be specified for certain columns and the server will automatically generate those for the missing columns.
[21]:
my_own_constraints = {
"bill_length_mm": ChainTransformer(
[
ClampTransformer(lower = bl_bounds["lower"] + 10, upper = bl_bounds["upper"] - 10),
BinTransformer(bins = 20, lower = bl_bounds["lower"] + 10, upper = bl_bounds["upper"] - 10),
]
)
}
In this case, only the bill_length will be clamped.
[22]:
res_dummy = client.smartnoise_synth_query(
synth_name="mwem",
epsilon=1.0,
select_cols = ["bill_length_mm", "bill_depth_mm", "species"],
constraints = my_own_constraints,
dummy=True,
)
res_dummy['query_response'].head()
[22]:
bill_length_mm | bill_depth_mm | species | |
---|---|---|---|
0 | 41.125 | 20.5 | Chinstrap |
1 | 53.875 | 20.5 | Adelie |
2 | 49.375 | 21.5 | Chinstrap |
3 | 53.875 | 20.5 | Adelie |
4 | 41.125 | 22.5 | Chinstrap |
MST: Maximum Spanning Tree
She now experiments with MST. As the synthesizer is very needy in terms of computation, she selects a subset of column for it. See MST here.
[23]:
res_dummy = client.smartnoise_synth_query(
synth_name="mst",
epsilon=1.0,
select_cols = ["species", "sex"],
dummy=True,
)
res_dummy['query_response'].head()
[23]:
species | sex | |
---|---|---|
0 | FEMALE | |
1 | Adelie | FEMALE |
2 | Gentoo | MALE |
3 | Gentoo | MALE |
4 | Adelie |
She can also specify a specific number of samples to get (if return_model is not True):
[24]:
res_dummy = client.smartnoise_synth_query(
synth_name="mst",
epsilon=1.0,
select_cols = ["species", "sex"],
nb_samples = 4,
dummy=True,
)
res_dummy['query_response']
[24]:
species | sex | |
---|---|---|
0 | Adelie | FEMALE |
1 | Chinstrap | |
2 | Adelie | MALE |
3 | Adelie | FEMALE |
And a condition on these samples. For instance, here, she only wants female samples.
[25]:
res_dummy = client.smartnoise_synth_query(
synth_name="mst",
epsilon=1.0,
select_cols = ["sex", "species"],
nb_samples = 4,
condition = "sex = FEMALE",
dummy=True,
)
res_dummy['query_response']
[25]:
sex | species | |
---|---|---|
0 | Gentoo | |
1 | Gentoo | |
2 | Chinstrap | |
3 | Gentoo |
DPCTGAN: Differentially Private Conditional Tabular GAN
She now tries DPCTGAN. A first warning let her know that the random noise generation for this model is not cryptographically secure and if it is not ok for her, she can decode to stop using this synthesizer. Then she does not get a response but an error 422 with an explanation.
[26]:
res_dummy = client.smartnoise_synth_query(
synth_name="dpctgan",
epsilon=1.0,
dummy=True,
)
res_dummy
/code/lomas_client/utils.py:37: UserWarning: Warning:dpctgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).
warnings.warn(
Server error status 422: {"ExternalLibraryException":"Error fitting model: sample_rate=5.0 is not a valid value. Please provide a float between 0 and 1. Try decreasing batch_size in synth_params (default batch_size=500).","library":"smartnoise_synth"}
The default parameters of DPCTGAN do not work for PENGUIN dataset. Hence, as advised in the error message, she decreases the batch_size (also she chekcks the documentation here.
[27]:
res_dummy = client.smartnoise_synth_query(
synth_name="dpctgan",
epsilon=1.0,
synth_params = {"batch_size": 50},
dummy=True,
)
res_dummy['query_response'].head()
[27]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Gentoo | Torgersen | 47.809876 | 17.548668 | 188.346262 | 5152.039871 | MALE |
1 | Gentoo | Torgersen | 45.505892 | 15.785036 | 215.527822 | 4079.444863 | FEMALE |
2 | Chinstrap | Biscoe | 39.293022 | 19.115118 | 176.019657 | 3719.095856 | MALE |
3 | Adelie | Dream | 47.029771 | 17.322245 | 202.441730 | 4394.811036 | MALE |
4 | Gentoo | Dream | 44.393938 | 17.215382 | 200.788893 | 4653.589007 | FEMALE |
[ ]:
PATEGAN: Private Aggregation of Teacher Ensembles
Unfortunatelly, she is not able to train the pategan synthetizer on the PENGUIN dataset. Hence, she must try another one.
[28]:
res_dummy = client.smartnoise_synth_query(
synth_name="pategan",
epsilon=1.0,
dummy=True,
)
res_dummy
Server error status 422: {"ExternalLibraryException":"pategan not reliable with this dataset.","library":"smartnoise_synth"}
PATECTGAN: Conditional tabular GAN using Private Aggregation of Teacher Ensembles
[29]:
res_dummy = client.smartnoise_synth_query(
synth_name="patectgan",
epsilon=1.0,
dummy=True,
)
res_dummy['query_response'].head()
[29]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 44.929629 | 17.136712 | 206.450379 | 3677.639037 | MALE |
1 | Chinstrap | Biscoe | 41.911177 | 15.958753 | 219.224480 | 3503.649727 | MALE |
2 | Chinstrap | Biscoe | 39.371106 | 18.538643 | 215.860750 | 6183.452576 | MALE |
3 | Gentoo | Dream | 39.778699 | 19.392393 | 220.403337 | 6291.852415 | MALE |
4 | Gentoo | Torgersen | 41.940040 | 18.712894 | 209.865149 | 3379.273981 | FEMALE |
[30]:
res_dummy = client.smartnoise_synth_query(
synth_name="patectgan",
epsilon=1.0,
select_cols = ["island", "bill_length_mm", "body_mass_g"],
synth_params = {
"embedding_dim": 256,
"generator_dim": (128, 128),
"discriminator_dim": (256, 256),
"generator_lr": 0.0003,
"generator_decay": 1e-05,
"discriminator_lr": 0.0003,
"discriminator_decay": 1e-05,
"batch_size": 500
},
nb_samples = 100,
dummy=True,
)
res_dummy['query_response'].head()
[30]:
island | bill_length_mm | body_mass_g | |
---|---|---|---|
0 | Torgersen | 48.482846 | 4627.040502 |
1 | Biscoe | 36.627252 | 4458.971380 |
2 | Biscoe | 50.437214 | 4419.834249 |
3 | Torgersen | 57.515079 | 4114.015296 |
4 | Dream | 37.914762 | 3729.208380 |
DPGAN: DIfferentially Private GAN
For DPGAN, there is the same warning as for DPCTGAN with the cryptographically secure random noise generation.
[31]:
res_dummy = client.smartnoise_synth_query(
synth_name="dpgan",
epsilon=1.0,
dummy=True,
)
res_dummy['query_response'].head()
/code/lomas_client/utils.py:37: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).
warnings.warn(
[31]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Chinstrap | Torgersen | 47.418710 | 23.000000 | 192.233215 | 4478.892826 | MALE |
1 | Adelie | Biscoe | 45.057986 | 17.836150 | 195.967093 | 5085.548207 | MALE |
2 | Adelie | Dream | 46.495445 | 17.600734 | 195.489978 | 3655.695215 | MALE |
3 | Gentoo | Torgersen | 58.334219 | 19.706083 | 250.000000 | 7000.000000 | FEMALE |
4 | Adelie | Biscoe | 65.000000 | 16.195120 | 206.065695 | 4478.947320 | FEMALE |
One final time she samples with conditions:
[32]:
res_dummy = client.smartnoise_synth_query(
synth_name="dpgan",
epsilon=1.0,
condition = "body_mass_g > 5000",
dummy=True,
)
res_dummy['query_response'].head()
/code/lomas_client/utils.py:37: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).
warnings.warn(
[32]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 43.485997 | 19.496153 | 180.303064 | 6134.536535 | MALE |
1 | Gentoo | Torgersen | 63.116608 | 21.453384 | 189.907485 | 7000.000000 | MALE |
2 | Adelie | Dream | 65.000000 | 20.504224 | 189.021906 | 5361.825123 | MALE |
3 | Adelie | Dream | 65.000000 | 20.310123 | 193.809981 | 5012.155779 | FEMALE |
4 | Adelie | Torgersen | 64.386925 | 16.037792 | 188.511754 | 6332.941920 | MALE |
And now on the real dataset
[33]:
res_dummy = client.smartnoise_synth_query(
synth_name="dpgan",
epsilon=1.0,
condition = "body_mass_g > 5000",
nb_samples = 10,
dummy=False,
)
res_dummy['query_response']
/code/lomas_client/utils.py:37: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).
warnings.warn(
[33]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 43.332719 | 16.996525 | 197.196980 | 6542.874247 | FEMALE |
1 | Gentoo | Biscoe | 53.666000 | 19.943369 | 184.627506 | 5256.182522 | FEMALE |
2 | Adelie | Torgersen | 47.092630 | 21.298139 | 193.409859 | 5405.979574 | FEMALE |
3 | Chinstrap | Biscoe | 54.154047 | 15.847898 | 192.380197 | 6358.633310 | MALE |
4 | Adelie | Biscoe | 44.494324 | 16.248922 | 188.783114 | 5486.318067 | FEMALE |
5 | Gentoo | Biscoe | 45.828992 | 16.119691 | 209.792230 | 6048.840851 | FEMALE |
6 | Chinstrap | Biscoe | 45.586326 | 17.164572 | 196.064838 | 6345.135689 | FEMALE |
7 | Gentoo | Biscoe | 52.858690 | 18.386937 | 190.213206 | 5387.799114 | MALE |
8 | Chinstrap | Biscoe | 65.000000 | 17.667682 | 203.667950 | 7000.000000 | FEMALE |
9 | Chinstrap | Biscoe | 65.000000 | 19.445511 | 195.644815 | 5514.548540 | FEMALE |
Step 6: See archives of queries
She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses.
[34]:
previous_queries = client.get_previous_queries()
Let’s check the last query
[36]:
last_query = previous_queries[-1]
last_query
[36]:
{'user_name': 'Dr. Antartica',
'dataset_name': 'PENGUIN',
'dp_librairy': 'smartnoise_synth',
'client_input': {'dataset_name': 'PENGUIN',
'synth_name': 'dpgan',
'epsilon': 1.0,
'delta': None,
'select_cols': [],
'synth_params': {},
'nullable': True,
'constraints': '',
'return_model': False,
'condition': 'body_mass_g > 5000',
'nb_samples': 10},
'response': {'requested_by': 'Dr. Antartica',
'query_response': species island bill_length_mm bill_depth_mm flipper_length_mm \
0 Adelie Torgersen 43.332719 16.996525 197.196980
1 Gentoo Biscoe 53.666000 19.943369 184.627506
2 Adelie Torgersen 47.092630 21.298139 193.409859
3 Chinstrap Biscoe 54.154047 15.847898 192.380197
4 Adelie Biscoe 44.494324 16.248922 188.783114
5 Gentoo Biscoe 45.828992 16.119691 209.792230
6 Chinstrap Biscoe 45.586326 17.164572 196.064838
7 Gentoo Biscoe 52.858690 18.386937 190.213206
8 Chinstrap Biscoe 65.000000 17.667682 203.667950
9 Chinstrap Biscoe 65.000000 19.445511 195.644815
body_mass_g sex
0 6542.874247 FEMALE
1 5256.182522 FEMALE
2 5405.979574 FEMALE
3 6358.633310 MALE
4 5486.318067 FEMALE
5 6048.840851 FEMALE
6 6345.135689 FEMALE
7 5387.799114 MALE
8 7000.000000 FEMALE
9 5514.548540 FEMALE ,
'spent_epsilon': 1.0,
'spent_delta': 0.00015673368198174188},
'timestamp': 1724943669.4523578}
[ ]: