Lomas Client Side: Using Smartnoise-SQL
This notebook showcases how researcher could use lomas platform with Smartnoise-SQL. It explains the different functionnalities provided by the lomas-client
client library to interact with lomas server.
The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.
Each user has access to one or multiple projects and for each dataset has a limited budget \(\epsilon\), \(\delta\).
In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data.
Step 1: Install the library
To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client
on her local developping environment.
It can be installed via the pip command:
[1]:
# !pip install lomas_client
Or using a local version of the client
[2]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
[3]:
from lomas_client.client import Client
import numpy as np
Step 2: Initialise the client
Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.
To create the client, Dr. Antartica needs to give it a few parameters: - a url: the root application endpoint to the remote secure server. - user_name: her name as registered in the database (Dr. Alice Antartica) - dataset_name: the name of the dataset that she wants to query (PENGUIN)
She will only be able to query on the real dataset if the administrator has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some \(\epsilon\), \(\delta\) privacy loss budget.
[4]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)
And that’s it for the preparation. She is now ready to use the various functionnalities offered by lomas-client
.
Step 3: Getting dataset metadata
[5]:
metadata = client.get_dataset_metadata()
metadata
[5]:
{'max_ids': 1,
'row_privacy': True,
'censor_dims': False,
'columns': {'species': {'type': 'string',
'cardinality': 3,
'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
'island': {'type': 'string',
'cardinality': 3,
'categories': ['Torgersen', 'Biscoe', 'Dream']},
'bill_length_mm': {'type': 'float', 'lower': 30.0, 'upper': 65.0},
'bill_depth_mm': {'type': 'float', 'lower': 13.0, 'upper': 23.0},
'flipper_length_mm': {'type': 'float', 'lower': 150.0, 'upper': 250.0},
'body_mass_g': {'type': 'float', 'lower': 2000.0, 'upper': 7000.0},
'sex': {'type': 'string',
'cardinality': 2,
'categories': ['MALE', 'FEMALE']}},
'rows': 344}
[6]:
nb_penguin = metadata['rows']
print(f"Number of penguins: {nb_penguin}.")
Number of penguins: 344.
[7]:
columns = metadata["columns"].keys()
columns
[7]:
dict_keys(['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'])
Step 4: Average bill length with Smartnoise-SQL
Query dummy dataset
Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this, other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.
It is possible to specify the flag dummy=True
in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them.
[8]:
# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"
[9]:
EPSILON = 0.5
DELTA = 1e-5
[10]:
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
dummy = True,
)
[11]:
avg_bl_dummy = np.round(dummy_res['query_response']["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
[11]:
'Average bill length on dummy: 46.68mm.'
Estimate cost of a query
Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an epsilon
and a delta
.
[12]:
cost = client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
)
[13]:
f'This query would actually cost her {cost["epsilon_cost"]} epsilon and {cost["delta_cost"]} delta.'
[13]:
'This query would actually cost her 1.0 epsilon and 5.000000000032756e-06 delta.'
This is actually twice as much as what she initially put in. In the background, Smartnoise-SQL decomposes the DP query in multiple other queries and the budget given as input is spent on each of these sub-queries. Here for the average, we need a sum divided by a count, hence EPSILON
is spent once for the sum and then once more for the count. (see NOTE below for tips and explanation).
Overide DP mechanism
She wants to use another DP-mechanism for this query. She can change it via the mechanism
argument. See Smartnoise-SQL documentation here for overriding mechanisms.
[14]:
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"},
dummy = True,
)
[15]:
avg_bl_dummy = np.round(dummy_res['query_response']["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
[15]:
'Average bill length on dummy: 50.83mm.'
[16]:
cost = client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"}
)
cost
[16]:
{'epsilon_cost': 1.0, 'delta_cost': 1.4999949999983109e-05}
Query real dataset
Dr. Antartica is ready to query the real dataset and get a differentially private response for the average bill length. The dummy
flag is False by default, so setting it is optional. She uses the values of epsilon
and delta
that she selected just before.
Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query.
[17]:
avg_bill_length_response = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"},
dummy = False
)
[18]:
avg_bill_length = avg_bill_length_response['query_response']['avg_bill_length_mm'].iloc[0]
print(f"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.")
Average bill length on private data: 45.19mm.
After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:
Postprocess
It is also possible to use the ‘postprocess’ argument from Smartnoise-SQL see its documentation here by specifying it in the query.
[19]:
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
postprocess = True,
dummy = True,
)
dummy_res
[19]:
{'query_response': avg_bill_length_mm
0 46.850983}
[20]:
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
postprocess = False,
dummy = True,
)
dummy_res
[20]:
{'query_response': res_0 res_1
0 4659.909203 96.041455}
Step 4: Penguin statistics
Confidence intervals for flipper length over the whole population
She is first interested to have a better idea of the distribution of bill length of all species. She already has the number of penguins (=number of rows as max_ids=1
) from the metadata and the average bill length from step 3, so she just needs to compute the standard deviation. As it is just an exploration step, she uses very little budget values.
[21]:
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
She again first verifies that her query works on the dummy dataset:
[22]:
dummy_res = client.smartnoise_sql_query(
query = QUERY,
epsilon = 0.5,
delta = 1e-5,
dummy = True
)
[23]:
dummy_std = np.round(dummy_res['query_response']['std_bill_length_mm'].iloc[0], 2)
f"The dummy standard variation is {dummy_std}."
[23]:
'The dummy standard variation is 16.64.'
The syntax of the query works, now she checks the budget:
[24]:
cost = client.estimate_smartnoise_sql_cost(
query = QUERY,
epsilon = 0.5,
delta = 1e-5
)
[25]:
f'This query would actually cost her {cost["epsilon_cost"]} epsilon and {cost["delta_cost"]} delta.'
[25]:
'This query would actually cost her 1.5 epsilon and 5.000000000032756e-06 delta.'
This time it is three times the budget because the standard deviation needs the average, then a difference and a count again.
[26]:
response = client.smartnoise_sql_query(
query = QUERY,
epsilon = 0.5,
delta = 1e-5
)
[27]:
std_bill_length = response['query_response']['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")
Standard deviation of bill length: 8.83.
She can now do all the postprocessing that she wants with the returned data without increasing the privacy risk.
[28]:
# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")
Standard error of bill length: 0.48.
[29]:
# Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")
The 95% confidence interval of the bill length of all penguins is [44.25, 46.12].
Note on budget with Smartnoise-SQL (Advanced)
All of these queries will cost the same budget in Smartnoise-SQL. The reason is that the smartnoise-sql translates the input query in sub queries, finds the answer for each sub query for the budget in input and then assembles the results. For the first ‘standard deviation’ query, it requires a count, an average, and only then the computation for the standard deviation. Hence, to save budget it is better to make a general query directly and retrieve all the ‘sub-answers’.
[30]:
epsilon = 1.0
delta = 1e-5
[31]:
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
cost = client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[31]:
{'epsilon_cost': 3.0, 'delta_cost': 5.000000000032756e-06}
[32]:
QUERY = "SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[32]:
{'epsilon_cost': 3.0, 'delta_cost': 5.000000000032756e-06}
[33]:
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[33]:
{'epsilon_cost': 3.0, 'delta_cost': 5.000000000032756e-06}
A way to know the sub-queries of a query is to use the following Smartnoise-SQL code:
[34]:
# Convert metadata to Smartnoise-SQL compliant metadata
metadata = dict(metadata)
metadata.update(metadata["columns"])
del metadata["columns"]
snsql_metadata = {"": {"": {"df": metadata}}}
[35]:
# Write the query to inspect
QUERY = "SELECT STD(bill_length_mm) as std_bl FROM df"
#QUERY = "SELECT COUNT(*) as nb_row FROM df"
[36]:
from snsql.sql.private_rewriter import Rewriter
rewriter = Rewriter(snsql_metadata)
rewriter.options.row_privacy = metadata["row_privacy"]
rewriter.options.max_contrib = metadata["max_ids"]
dp_query = rewriter.query(QUERY)
dp_query
[36]:
<snsql._ast.ast.Query at 0x70772e4e5310>
The original dp query is represented as one query:
[37]:
dp_query._named_symbols
[37]:
{'std_bl': <snsql._ast.tokens.Symbol at 0x70772e4f5a90>}
But has 4 named symbols inside: 2 alias for the 2 SQL subqueries - ‘keycount’ for ‘count_bill_length_mm’, - ‘sum_alias_0xxxx’ for ‘sum_bill_length_mm’
[38]:
subquery = dp_query.source.relations[0].primary.query
syms = subquery._named_symbols
syms
[38]:
{'keycount': <snsql._ast.tokens.Symbol at 0x70772f81b410>,
'sum_alias_0xde09': <snsql._ast.tokens.Symbol at 0x70772e4e7990>,
'count_bill_length_mm': <snsql._ast.tokens.Symbol at 0x70772e4e7a10>,
'sum_bill_length_mm': <snsql._ast.tokens.Symbol at 0x70772e4779d0>}
This last query with group_by
will cost the same because max_ids=1
(a penguin appears in the dataset at most once) and so the group_by
is applied on different partitions of the population.
[39]:
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species"
cost = client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[39]:
{'epsilon_cost': 3.0, 'delta_cost': 5.000000000032756e-06}
NOTE: in the current code of Smartnoise-SQL, there is no odometer. Meaning all queries are independant. If someone first queries the private dataset for a count, then a second time for the average and then for the standard deviation then the total cost will be added: 3 count + 2 average + 1 std. That’s why it is better to do everything in one query.
Hypothesis testing
So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.
She will do a two-tailed two-sample \(t\) test.
The null hypothese \(H_0\) is flipper length does not differ between species.
The alternative hypothesis \(H_a\) is flipper length does differ between species.
She set the level of significance at 0.05.
[40]:
CRITICAL_VALUE = 0.05
[41]:
QUERY = "SELECT \
species AS species, \
COUNT(bill_length_mm) AS nb_penguin, \
AVG(bill_length_mm) AS avg_bill_length_mm, \
STD(bill_length_mm) AS std_bill_length_mm \
FROM df GROUP BY species"
She checks the remaining budget:
[42]:
client.get_remaining_budget()
[42]:
{'remaining_epsilon': 7.5, 'remaining_delta': 0.004980000049999984}
She estimates how much budget it would really cost:
[43]:
client.estimate_smartnoise_sql_cost(query = QUERY, epsilon = 1.0, delta = 1e-4)
[43]:
{'epsilon_cost': 3.0, 'delta_cost': 4.999999999999449e-05}
The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly.
[44]:
dummy_res = client.smartnoise_sql_query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
dummy_res
[44]:
{'query_response': species nb_penguin avg_bill_length_mm std_bill_length_mm
0 Adelie 25 59.830486 27.444144
1 Chinstrap 47 13.527649 28.501660
2 Gentoo 28 42.375624 33.967001}
She did not give enough budget for the query to work. This is why there are ‘NANs’ in the output. She has to spend more budget for the query to work.
[45]:
dummy_res = client.smartnoise_sql_query(query = QUERY, epsilon = 7.5/3, delta = 1e-4, dummy = True)
dummy_res
[45]:
{'query_response': species nb_penguin avg_bill_length_mm std_bill_length_mm
0 Adelie 36 49.021514 3.944748
1 Chinstrap 31 49.048848 3.801831
2 Gentoo 30 41.176308 7.628134}
If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway.
[46]:
flipper_length_response = client.smartnoise_sql_query(query = QUERY, epsilon = 7.5/3, delta = 1e-4)
And now she should not have any remaining budget:
But she can do her post-processing and hypothesis analysis.
[47]:
df_flipper = flipper_length_response['query_response']
df_flipper
[47]:
species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | |
---|---|---|---|---|
0 | Adelie | 151 | 38.362705 | 5.465330 |
1 | Chinstrap | 67 | 48.867188 | 3.828321 |
2 | Gentoo | 123 | 47.257728 | 5.387484 |
And finally the \(t\)-test:
[48]:
def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))
[49]:
nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]
[50]:
t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)
print(f"T test between species 0 and specie 1: {np.round(t_01, 2)}. Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between species 1 and specie 2: {np.round(t_12, 2)}. Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")
T test between specie 0 and specie 1: -14.41. Reject null hypothesis: True.
T test between specie 0 and specie 2: -13.49. Reject null hypothesis: True.
T test between specie 1 and specie 2: 2.19. Reject null hypothesis: True.
All t-tests are above the critical value of 0.5. She can reject the null hypothesis.
She finally computes the confidence intervals for the flipper length of each species
[51]:
ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper
[51]:
species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | standard_error | ci_95_lower_bound | ci_95_upper_bound | |
---|---|---|---|---|---|---|---|
0 | Adelie | 151 | 38.362705 | 5.465330 | 0.444762 | 37.490971 | 39.234439 |
1 | Chinstrap | 67 | 48.867188 | 3.828321 | 0.467704 | 47.950489 | 49.783888 |
2 | Gentoo | 123 | 47.257728 | 5.387484 | 0.485773 | 46.305613 | 48.209843 |
[ ]:
[ ]: