Secure Data Disclosure: Client side
This notebook showcases how researcher could use the Secure Data Disclosure system. It explains the different functionnalities provided by the dpserial client library to interact with the secure server.
The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.
Each user has access to one or multiple projects and for each dataset has a limited budget \(\epsilon\), \(\delta\).
🐧🐧🐧 In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins dimensions.
Therefore, the powerful queen Icerbegina 👑 had the data collected. But in order to get the penguins to agree to participate she promised them that no one would be able to look at the data and that no one would be able to guess the bill width of any specific penguin (which is very sensitive information) from the data. Nobody! Not even the researchers. The queen hence stored the data on the Secure Data Disclosure Server and only gave a small budget to Dr. Antarctica.
This is not a problem for Dr. Antarctica as she does not need to see the data to make statistics thanks to the Secure Data Disclosure Client library ofs_dpserial. 🐧🐧🐧
Step 1: Install the library
To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library fso_dpserial
on her local developping environment.
It can be installed via the pip command:
[1]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
# !pip install lomas_client
[2]:
from lomas_client.client import Client
import numpy as np
Step 2: Initialise the client
Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.
To create the client, Dr. Antartica needs to give it a few parameters: - a url: the root application endpoint to the remote secure server. - user_name: her name as registered in the database (Dr. Alice Antartica) - dataset_name: the name of the dataset that she wants to query (PENGUIN)
She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit. (As is done in the Secure Data Disclosure Notebook: Server side).
[3]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)
And that’s it for the preparation. She is now ready to use the various functionnalities offered by fso_dpserial
.
Step 3: Understand the functionnalities of the library
Getting dataset metadata
Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the get_dataset_metadata()
function of the client. As this is public information, this does not cost any budget.
This function returns metadata information in the same format as SmartnoiseSQL dictionary format, where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details).
[4]:
metadata = client.get_dataset_metadata()
metadata
[4]:
{'max_ids': 1,
'row_privacy': True,
'censor_dims': False,
'columns': {'species': {'type': 'string',
'cardinality': 3,
'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
'island': {'type': 'string',
'cardinality': 3,
'categories': ['Torgersen', 'Biscoe', 'Dream']},
'bill_length_mm': {'type': 'float', 'lower': 30.0, 'upper': 65.0},
'bill_depth_mm': {'type': 'float', 'lower': 13.0, 'upper': 23.0},
'flipper_length_mm': {'type': 'float', 'lower': 150.0, 'upper': 250.0},
'body_mass_g': {'type': 'float', 'lower': 2000.0, 'upper': 7000.0},
'sex': {'type': 'string',
'cardinality': 2,
'categories': ['MALE', 'FEMALE']}},
'rows': 344}
Based on this Dr. Antartica knows that there are 7 columns, 3 of string type (species, island, sex) and 4 of float type (bill length, bill depth, flipper length and body mass) with their associated bounds. She also knows based on the field max_ids: 1
that each penguin can only be once in the dataset and on the field row_privacy: True
that each row represents a single penguin.
Get a dummy dataset
Now, that she has seen and understood the metadata, she wants to get an even better understanding of the dataset (but is still not able to see it). A solution to have an idea of what the dataset looks like it to create a dummy dataset.
Based on the public metadata of the dataset, a random dataframe can be created created. By default, there will be 100 rows and the seed is set to 42 to ensure reproducibility, but these 2 variables can be changed to obtain different dummy datasets. Getting a dummy dataset does not affect the budget as there is no differential privacy here, it is not a synthetic dataset and all that could be learn here is already present in the public metadata.
Dr. Antartica first create a dummy dataset with the default options.
[5]:
df_dummy = client.get_dummy_dataset()
print(df_dummy.shape)
df_dummy.head()
(100, 7)
[5]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Chinstrap | Torgersen | 43.108904 | 13.314292 | 214.203165 | 2258.408606 | FEMALE |
1 | Adelie | Dream | 63.275001 | 19.364104 | 158.413996 | 4656.773158 | FEMALE |
2 | Adelie | Dream | 55.619788 | 16.143560 | 166.162871 | 4703.175608 | FEMALE |
3 | Adelie | Biscoe | 50.953047 | 18.085707 | 239.855419 | 5187.149507 | MALE |
4 | Gentoo | Torgersen | 35.460652 | 22.075665 | 210.642906 | 5630.456669 | MALE |
However, she would prefer to have a dataset with 200 rows and chooses a seed of 0, hence:
[6]:
NB_ROWS = 200
SEED = 0
[7]:
df_dummy = client.get_dummy_dataset(nb_rows = NB_ROWS, seed = SEED)
print(df_dummy.shape)
df_dummy.head()
(200, 7)
[7]:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Gentoo | Biscoe | 49.208473 | 16.117959 | 190.125950 | 2873.291927 | FEMALE |
1 | Gentoo | Torgersen | 55.031628 | 19.963435 | 242.929142 | 3639.940005 | FEMALE |
2 | Chinstrap | Torgersen | 51.096718 | 16.777518 | 159.961493 | 5401.743330 | MALE |
3 | Adelie | Biscoe | 49.070911 | 14.796037 | 244.530153 | 2316.038092 | MALE |
4 | Chinstrap | Biscoe | 44.827918 | 13.246787 | 236.948853 | 5036.246870 | FEMALE |
Query dummy dataset
Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.
It is possible to specify the flag dummy=True
in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them.
Therefore Dr. Antartica computes the results that she gets on the dummy dataframe that she created locally and on the same dummy dataframe in the server via a query and compare them to ensure that the query is well defined and works within the server.
She tests with an example on the average bill length on the dataframe.
[8]:
# On the local dummy dataframe
result_local_dummy = round(df_dummy['bill_length_mm'].mean(), 5)
result_local_dummy
[8]:
47.51532
As the query on the server goes through the same workflow for dummies and real data, she still has to set values for theoratical budget to spend on the dummy query. Of course, this theoretical budget will NOT affect her real budget as this is on dummy data.
It is recommended to use very high values on the budget parameters here to have little noise and small difference between the exact local result and the ‘little noisy’ server result.
Also, make sure to use the same values of number of rows and seed to have the same dummy datasets.
[9]:
# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"
[10]:
# On the remote server dummy dataframe
res = client.smartnoise_sql_query(
query = QUERY,
epsilon = 100.0, # make sure to select high values of epsilon and delta to have small differences
delta = 2.0, # make sure to select high values of epsilon and delta to have small differences
dummy = True,
nb_rows = NB_ROWS,
seed = SEED
)
res_server_dummy = res['query_response']["avg_bill_length_mm"][0]
res_server_dummy
[10]:
47.51229381350249
She then checks that the responses on the dummy locally and the dummy on the server are close enough (difference would be only due to small noise addition).
[11]:
np.testing.assert_almost_equal(
result_local_dummy,
res_server_dummy,
decimal=2,
err_msg="Responses are different, either try with a bigger budget or query is not doing what is intended."
)
As you can see res_local and res_server are close. We can accept that the small difference is due to the small noise added due to the large values of \(\epsilon\) and \(\delta\).
Get current budget
It is the first time that Dr. Antartica connects to the server and she wants to know how much buget the queen assigned her. Therefore, she calls the fonction get_initial_budget
.
[12]:
client.get_initial_budget()
[12]:
{'initial_epsilon': 10.0, 'initial_delta': 0.005}
She sees that she has 10.0 epsilon and 0.0004 epsilon at her disposal.
Then she checks her total spent budget get_total_spent_budget
. As she only did queries on metadata on dummy dataframes, this should still be 0.
[13]:
client.get_total_spent_budget()
[13]:
{'total_spent_epsilon': 2.714285714286655, 'total_spent_delta': 0.0}
It will also be useful to know what the remaining budget is. Therefore, she calls the function get_remaining_budget
. It just substarcts the total spent budget from the initial budget.
[14]:
client.get_remaining_budget()
[14]:
{'remaining_epsilon': 7.285714285713345, 'remaining_delta': 0.005}
As expected, for now the remaining budget is equal to the inital budget.
Estimate cost of a query
Another safeguard is the functionnality to estimate the cost of a query. As in OpenDP and SmartnoiseSQL, the budget that will by used by a query might be slightly different than what is asked by the user. The estimate cost
function returns the estimated real cost of any query.
Again, of course, this will not impact the user’s budget.
Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an epsilon
and a delta
.
[15]:
client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = 1.0,
delta = 1e-4
)
[15]:
{'epsilon_cost': 2.0, 'delta_cost': 4.999999999999449e-05}
So this query would actually cost her 3.0 epsilon and a little 1.499e-4 delta. As she does not want to spend to much budget here she tries other values of budget.
[16]:
client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = 0.2,
delta = 1e-5
)
[16]:
{'epsilon_cost': 0.4, 'delta_cost': 5.000000000032756e-06}
This query would actually cost her 0.6 epsilon and a similar delta. She decides that it is good enough.
[17]:
EPSILON = 0.2
DELTA = 1e-5
Query real dataset
Now that all the safeguard functions were tested, Dr. Antartica is ready to query on the real dataset and get a differentially private response of the average bill length. By default, the flag dummy
is False so setting it is optional. She uses the values of epsilon
and delta
that she selected just before.
Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query.
[18]:
client.get_remaining_budget()
[18]:
{'remaining_epsilon': 7.285714285713345, 'remaining_delta': 0.005}
[19]:
avg_bill_length_response = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
dummy = False
)
[20]:
avg_bill_length = avg_bill_length_response['query_response']['avg_bill_length_mm'].iloc[0]
print(f"Average bill length: {np.round(avg_bill_length, 2)}mm.")
Average bill length: 44.18mm.
After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:
[21]:
client.get_remaining_budget()
[21]:
{'remaining_epsilon': 6.885714285713345,
'remaining_delta': 0.004994999999999967}
As can be seen in get_total_spent_budget()
, it is the budget estimated with estimate_cost()
that was spent.
[22]:
client.get_total_spent_budget()
[22]:
{'total_spent_epsilon': 3.114285714286655,
'total_spent_delta': 5.000000000032756e-06}
Dr. Antartica has now a differentially private estimation of the bill length of all birds and is confident to use the library for the rest of her analyses.
Step 4: Penguin statistics
Confidence intervals for flipper length over the whole population
She is first interested to have a better idea of the distribution of flipper length of all species. She already has the mean from step 3, so she just need to compute the standard deviation and know the number of penguins in the dataset. As it is just an exploration step, she uses very little budget values.
[23]:
QUERY = "SELECT COUNT(bill_length_mm) AS nb_penguin, STD(bill_length_mm) AS std_bill_length_mm FROM df"
She again first verifies that her query works on the dummy dataset:
[24]:
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = 100.0,
delta = 10.0,
dummy = True
)
dummy_res['query_response']
[24]:
nb_penguin | std_bill_length_mm | |
---|---|---|
0 | 100 | 10.332225 |
The syntax of the query works, now she checks the budget:
[25]:
client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = 0.5,
delta = 1e-5
)
[25]:
{'epsilon_cost': 1.5, 'delta_cost': 5.000000000032756e-06}
It is a bit too much, she decides to test for less:
[26]:
client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = 0.25,
delta = 1e-5
)
[26]:
{'epsilon_cost': 0.75, 'delta_cost': 5.000000000032756e-06}
That’s fine, she is ready to query:
[27]:
response = client.smartnoise_sql_query(query = QUERY, epsilon = 0.25, delta = 1e-5)
response = response['query_response']
response
[27]:
nb_penguin | std_bill_length_mm | |
---|---|---|
0 | 343 | 13.064982 |
[28]:
nb_penguin = response['nb_penguin'].iloc[0]
print(f"Number of penguins: {nb_penguin}.")
std_bill_length = response['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")
Number of penguins: 343.
Standard deviation of bill length: 13.06.
She can now do all the postprocessing that she wants with the returned data without adding any privacy risk.
[29]:
# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")
Standard error of bill length: 0.71.
[30]:
# Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")
The 95% confidence interval of the bill length of all penguins is [42.8, 45.57].
Hypothesis testing
So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.
She will do a two-tailed two-sample \(t\) test.
The null hypothese \(H_0\) is flipper length does not differ between species.
The alternative hypothesis \(H_a\) is flipper length does differ between species.
She set the level of significance at 0.05.
[31]:
CRITICAL_VALUE = 0.05
[32]:
QUERY = "SELECT \
species AS species, \
COUNT(bill_length_mm) AS nb_penguin, \
AVG(bill_length_mm) AS avg_bill_length_mm, \
STD(bill_length_mm) AS std_bill_length_mm \
FROM df GROUP BY species"
She estimates how much budget it would really cost:
[33]:
client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = 1, delta = 1e-4)
[33]:
{'epsilon_cost': 3.0, 'delta_cost': 4.999999999999449e-05}
The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly.
[34]:
dummy_res = client.smartnoise_sql_query(query = QUERY, epsilon = 1, delta = 1.0, dummy = True)
dummy_res
[34]:
{'query_response': species nb_penguin avg_bill_length_mm std_bill_length_mm
0 Adelie 39 45.659944 10.695675
1 Chinstrap 33 45.690454 14.067739
2 Gentoo 31 38.472887 14.542186}
She did not give enough budget for the query to work. This is why there are ‘NANs’ in the output. She has to spend more budget for the query to work.
[35]:
client.get_remaining_budget()
[35]:
{'remaining_epsilon': 6.135714285713345,
'remaining_delta': 0.004989999999999935}
The maximum she can do with all her remaining budget of 7.4 is around 7.4/4 = 1.85. Let’s check:
[42]:
client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = 7.4/4, delta = 1e-4)
[42]:
{'epsilon_cost': 5.550000000000001, 'delta_cost': 4.999999999999449e-05}
[43]:
dummy_res = client.smartnoise_sql_query(query = QUERY, epsilon = 7.4/4, delta = 1e-4, dummy = True)
dummy_res
[43]:
{'query_response': species nb_penguin avg_bill_length_mm std_bill_length_mm
0 Adelie 37 48.755816 3.634415
1 Chinstrap 33 46.912863 4.552931
2 Gentoo 29 41.803438 17.566451}
If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway.
[44]:
flipper_length_response = client.smartnoise_sql_query(query = QUERY, epsilon = 7.4/4, delta = 1e-4)
And now she should not have any remaining budget:
[45]:
client.get_remaining_budget()
[45]:
{'remaining_epsilon': 0.5857142857133439,
'remaining_delta': 0.00493999999999994}
But she can do her post-processing and hypothesis analysis.
[46]:
df_flipper = flipper_length_response['query_response']
df_flipper
[46]:
species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | |
---|---|---|---|---|
0 | Adelie | 150 | 38.649887 | 3.997587 |
1 | Chinstrap | 67 | 49.285002 | 5.859511 |
2 | Gentoo | 122 | 47.557167 | 4.643492 |
And finally the \(t\)-test:
[47]:
def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))
[48]:
nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]
[49]:
t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)
print(f"T test between specie 0 and specie 1: {np.round(t_01, 2)}. Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between specie 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between specie 1 and specie 2: {np.round(t_12, 2)}. Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")
T test between specie 0 and specie 1: -15.84. Reject null hypothesis: True.
T test between specie 0 and specie 2: -17.04. Reject null hypothesis: True.
T test between specie 1 and specie 2: 2.24. Reject null hypothesis: True.
All t-tests are above the critical value of 0.5. She can reject the null hypothesis.
She finally computes the confidence intervals for the flipper length of each species
[50]:
ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper
[50]:
species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | standard_error | ci_95_lower_bound | ci_95_upper_bound | |
---|---|---|---|---|---|---|---|
0 | Adelie | 150 | 38.649887 | 3.997587 | 0.326402 | 38.010140 | 39.289634 |
1 | Chinstrap | 67 | 49.285002 | 5.859511 | 0.715853 | 47.881930 | 50.688074 |
2 | Gentoo | 122 | 47.557167 | 4.643492 | 0.420402 | 46.733179 | 48.381155 |
She can now go and present her findings to queen Icebergina.
[ ]:
[ ]: