Using Smartnoise-SQL¶
This notebook showcases how researcher could use lomas platform with Smartnoise-SQL. It explains the different functionnalities provided by the lomas-client client library to interact with lomas server.
The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.
Each user has access to one or multiple projects and for each dataset has a limited budget $\epsilon$, $\delta$.
In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data.
Step 1: Install the library¶
To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client on her local developping environment.
It can be installed via the pip command:
# !pip install lomas_client
Or using a local version of the client
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
from lomas_client.client import Client
from csvw_eo.csvw_to_smartnoise_sql import csvw_to_smartnoise_sql
import numpy as np
Step 2: Initialise the client¶
Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.
The client needs a few parameters to be created. Usually, these would be set in the environment by the system administrator (queen Icebergina) and be transparent to lomas users. In this instance, the following code snippet sets a few of these parameters that are specific to this notebook.
She will only be able to query on the real dataset if the administrator has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some $\epsilon$, $\delta$ privacy loss budget.
# The following would usually be set in the environment by a system administrator
# and be tranparent to lomas users.
# Uncomment them if you are running against a Kubernetes deployment.
# They have already been set for you if you are running locally within a devenv or the Jupyter lab set up by Docker compose.
import os
# os.environ["LOMAS_CLIENT_APP_URL"] = "https://lomas.example.com:443"
# os.environ["LOMAS_CLIENT_OIDC_DISCOVERY_URL"] = "https://dex.example.com:443/.well-known/openid-configuration"
# os.environ["LOMAS_CLIENT_TELEMETRY__ENABLED"] = "false"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_ENDPOINT"] = "http://otel.example.com:445"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_INSECURE"] = "true"
# os.environ["LOMAS_CLIENT_TELEMETRY__SERVICE_ID"] = "my-app-client"
# os.environ["LOMAS_CLIENT_REALM"] = "lomas"
# We set these ones because they are specific to this notebook.
os.environ["LOMAS_CLIENT_USER_NAME"] = "dr.antartica@example.com"
os.environ["LOMAS_CLIENT_USER_PASSWORD"] = "dr.antartica"
os.environ["LOMAS_CLIENT_DATASET_NAME"] = "PENGUIN"
# Note that all client settings can also be passed as keyword arguments to the Client constructor.
# eg. client = Client(user_name = "Dr.Antartica") takes precedence over setting the "LOMAS_CLIENT_USER_NAME"
# environment variable.
client = Client()
[09:18:56] WARNING OIDC IdP or Lomas service configured without TLS -> using insecure transport http_client.py:35
And that's it for the preparation. She is now ready to use the various functionnalities offered by lomas-client.
Step 3: Getting dataset metadata¶
metadata = client.get_dataset_metadata()
metadata
{'@context': ['http://www.w3.org/ns/csvw', 'csvw-eo-context.jsonld'],
'@type': 'Table',
'privacyUnit': None,
'maxContributions': 1,
'maxLength': 344,
'publicLength': 344,
'tableSchema': {'columns': [{'@type': 'Column',
'name': 'species',
'datatype': <DataTypes.STRING: 'string'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'keys': ['Adelie', 'Chinstrap', 'Gentoo'],
'maxNumPartitions': 3},
{'@type': 'Column',
'name': 'island',
'datatype': <DataTypes.STRING: 'string'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'keys': ['Torgersen', 'Biscoe', 'Dream'],
'maxNumPartitions': 3},
{'@type': 'Column',
'name': 'bill_length_mm',
'datatype': <DataTypes.FLOAT: 'float'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'minimum': 30.0,
'maximum': 65.0},
{'@type': 'Column',
'name': 'bill_depth_mm',
'datatype': <DataTypes.FLOAT: 'float'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'minimum': 13.0,
'maximum': 23.0},
{'@type': 'Column',
'name': 'flipper_length_mm',
'datatype': <DataTypes.FLOAT: 'float'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'minimum': 150.0,
'maximum': 250.0},
{'@type': 'Column',
'name': 'body_mass_g',
'datatype': <DataTypes.FLOAT: 'float'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'minimum': 2000.0,
'maximum': 7000.0},
{'@type': 'Column',
'name': 'sex',
'datatype': <DataTypes.STRING: 'string'>,
'required': None,
'privacyId': None,
'nullableProportion': 0.0,
'keys': ['MALE', 'FEMALE'],
'maxNumPartitions': 2}]}}
nb_penguin = metadata['publicLength']
print(f"Number of penguins: {nb_penguin}.")
Number of penguins: 344.
Step 4: Average bill length with Smartnoise-SQL¶
Query dummy dataset¶
Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this, other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.
It is possible to specify the flag dummy=True in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them.
# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"
EPSILON = 0.5
DELTA = 1e-5
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
dummy = True,
)
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
'Average bill length on dummy: 44.16mm.'
Estimate cost of a query¶
Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an epsilon and a delta.
cost = client.smartnoise_sql.cost(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
)
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
'This query would actually cost her 1.5 epsilon and 1.4999949999983109e-05 delta.'
This is actually twice as much as what she initially put in. In the background, Smartnoise-SQL decomposes the DP query in multiple other queries and the budget given as input is spent on each of these sub-queries. Here for the average, we need a sum divided by a count, hence EPSILON is spent once for the sum and then once more for the count. (see NOTE below for tips and explanation).
Overide DP mechanism¶
She wants to use another DP-mechanism for this query. She can change it via the mechanism argument. See Smartnoise-SQL documentation here for overriding mechanisms.
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"},
dummy = True,
)
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
'Average bill length on dummy: 50.17mm.'
cost = client.smartnoise_sql.cost(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"}
)
cost
CostResponse(epsilon=1.5, delta=2.49998000004803e-05)
Query real dataset¶
Dr. Antartica is ready to query the real dataset and get a differentially private response for the average bill length. The dummy flag is False by default, so setting it is optional. She uses the values of epsilon and delta that she selected just before.
Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query.
avg_bill_length_response = client.smartnoise_sql.query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
mechanisms = {"count": "gaussian", "sum_float": "laplace"},
dummy = False
)
avg_bill_length = avg_bill_length_response.result.df['avg_bill_length_mm'].iloc[0]
print(f"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.")
Average bill length on private data: 42.36mm.
After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:
Postprocess¶
It is also possible to use the 'postprocess' argument from Smartnoise-SQL see its documentation here by specifying it in the query.
dummy_res = client.smartnoise_sql.query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
postprocess = True,
dummy = True,
)
dummy_res
QueryResponse(epsilon=1.5, delta=1.4999949999983109e-05, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df= avg_bill_length_mm 0 38.707424))
dummy_res = client.smartnoise_sql.query(
query = QUERY,
epsilon = EPSILON,
delta = DELTA,
postprocess = False,
dummy = True,
)
dummy_res
QueryResponse(epsilon=1.5, delta=1.4999949999983109e-05, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df= res_0 res_1 res_2 0 93.989285 4920.846654 101.370471))
Step 4: Penguin statistics¶
Confidence intervals for flipper length over the whole population¶
She is first interested to have a better idea of the distribution of bill length of all species. She already has the number of penguins (=number of rows as max_ids=1) from the metadata and the average bill length from step 3, so she just needs to compute the standard deviation. As it is just an exploration step, she uses very little budget values.
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
She again first verifies that her query works on the dummy dataset:
dummy_res = client.smartnoise_sql.query(
query = QUERY,
epsilon = 0.5,
delta = 1e-5,
dummy = True
)
dummy_std = np.round(dummy_res.result.df['std_bill_length_mm'].iloc[0], 2)
f"The dummy standard variation is {dummy_std}."
'The dummy standard variation is 9.36.'
The syntax of the query works, now she checks the budget:
cost = client.smartnoise_sql.cost(
query = QUERY,
epsilon = 0.5,
delta = 1e-5
)
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
'This query would actually cost her 2.0 epsilon and 1.4999949999983109e-05 delta.'
This time it is three times the budget because the standard deviation needs the average, then a difference and a count again.
response = client.smartnoise_sql.query(
query = QUERY,
epsilon = 0.5,
delta = 1e-5
)
std_bill_length = response.result.df['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")
Standard deviation of bill length: 3.02.
She can now do all the postprocessing that she wants with the returned data without increasing the privacy risk.
# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")
Standard error of bill length: 0.16.
# Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")
The 95% confidence interval of the bill length of all penguins is [42.04, 42.67].
Note on budget with Smartnoise-SQL (Advanced)¶
All of these queries will cost the same budget in Smartnoise-SQL. The reason is that the smartnoise-sql translates the input query in sub queries, finds the answer for each sub query for the budget in input and then assembles the results. For the first 'standard deviation' query, it requires a count, an average, and only then the computation for the standard deviation. Hence, to save budget it is better to make a general query directly and retrieve all the 'sub-answers'.
epsilon = 1.0
delta = 1e-5
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)
QUERY = "SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)
A way to know the sub-queries of a query is to use the following Smartnoise-SQL code:
# Convert metadata to Smartnoise-SQL compliant metadata
snsql_metadata = csvw_to_smartnoise_sql(metadata)
snsql_metadata
{'': {'': {'df': {'max_ids': 1,
'rows': 344,
'species': {'name': 'species', 'type': 'string', 'nullable': True},
'island': {'name': 'island', 'type': 'string', 'nullable': True},
'bill_length_mm': {'name': 'bill_length_mm',
'type': 'float',
'nullable': True,
'lower': 30.0,
'upper': 65.0},
'bill_depth_mm': {'name': 'bill_depth_mm',
'type': 'float',
'nullable': True,
'lower': 13.0,
'upper': 23.0},
'flipper_length_mm': {'name': 'flipper_length_mm',
'type': 'float',
'nullable': True,
'lower': 150.0,
'upper': 250.0},
'body_mass_g': {'name': 'body_mass_g',
'type': 'float',
'nullable': True,
'lower': 2000.0,
'upper': 7000.0},
'sex': {'name': 'sex', 'type': 'string', 'nullable': True},
'row_privacy': True}}}}
# Write the query to inspect
QUERY = "SELECT STD(bill_length_mm) as std_bl FROM df"
#QUERY = "SELECT COUNT(*) as nb_row FROM df"
from snsql.sql.private_rewriter import Rewriter
rewriter = Rewriter(snsql_metadata)
rewriter.options.row_privacy = snsql_metadata[""][""]["df"]["row_privacy"]
rewriter.options.max_contrib = snsql_metadata[""][""]["df"]["max_ids"]
dp_query = rewriter.query(QUERY)
dp_query
<snsql._ast.ast.Query at 0x7c6c94ab1250>
The original dp query is represented as one query:
dp_query._named_symbols
{'std_bl': <snsql._ast.tokens.Symbol at 0x7c6c94aa8550>}
But has 4 named symbols inside: 2 alias for the 2 SQL subqueries
- 'keycount' for 'count_bill_length_mm',
- 'sum_alias_0xxxx' for 'sum_bill_length_mm'
subquery = dp_query.source.relations[0].primary.query
syms = subquery._named_symbols
syms
{'keycount': <snsql._ast.tokens.Symbol at 0x7c6ca85993d0>,
'sum_alias_0xf3d7': <snsql._ast.tokens.Symbol at 0x7c6ca87c2990>,
'count_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7c6ca87c2d00>,
'sum_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7c6c94aa8370>}
This last query with group_by will cost the same because max_ids=1 (a penguin appears in the dataset at most once) and so the group_by is applied on different partitions of the population.
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)
NOTE: in the current code of Smartnoise-SQL, there is no odometer. Meaning all queries are independant. If someone first queries the private dataset for a count, then a second time for the average and then for the standard deviation then the total cost will be added: 3 count + 2 average + 1 std. That's why it is better to do everything in one query.
Hypothesis testing¶
So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.
She will do a two-tailed two-sample $t$ test.
- The null hypothese $H_0$ is flipper length does not differ between species.
- The alternative hypothesis $H_a$ is flipper length does differ between species.
She set the level of significance at 0.05.
CRITICAL_VALUE = 0.05
QUERY = "SELECT \
species AS species, \
COUNT(bill_length_mm) AS nb_penguin, \
AVG(bill_length_mm) AS avg_bill_length_mm, \
STD(bill_length_mm) AS std_bill_length_mm \
FROM df GROUP BY species"
She checks the remaining budget:
client.get_remaining_budget()
RemainingBudgetResponse(remaining_epsilon=46.5, remaining_delta=0.004960000249999537)
She estimates how much budget it would really cost:
client.smartnoise_sql.cost(query = QUERY, epsilon = 1.0, delta = 1e-4)
CostResponse(epsilon=4.0, delta=0.00014999500000001387)
The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly.
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
dummy_res
--------------------------------------------------------------------------- ExternalLibraryException Traceback (most recent call last) Cell In[43], line 1 ----> 1 dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True) 2 dummy_res File ~/Desktop/lomas/client/lomas_client/libraries/smartnoise_sql.py:119, in SmartnoiseSQLClient.query(self, query, epsilon, delta, mechanisms, postprocess, dummy, nb_rows, seed) 116 body = request_model.model_validate(body_dict) 117 res = self.http_client.post(endpoint, body) --> 119 return validate_model_response(self.http_client, res, QueryResponse) File ~/Desktop/lomas/client/lomas_client/utils.py:69, in validate_model_response(client, response, response_model) 67 if job.status == "failed": 68 assert job.error is not None, f"job {job_uid} failed without error !" ---> 69 raise_error_from_model(job.error) 71 return response_model.model_validate(job.result) File ~/Desktop/lomas/core/lomas_core/error_handler.py:150, in raise_error_from_model(error_model) 148 raise InvalidQueryException(error_model.message) 149 case ExternalLibraryExceptionModel(): --> 150 raise ExternalLibraryException(error_model.library, error_model.message) 151 case UnauthorizedAccessExceptionModel(): 152 raise UnauthorizedAccessException(error_model.message) ExternalLibraryException: (<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, 'SQL Reader generated empty results. Epsilon: 0.1 and Delta: 1e-08 are too small to generate output.')
She did not give enough budget for the query to work. This is why there are 'NANs' in the output. She has to spend more budget for the query to work.
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4, dummy = True)
dummy_res
QueryResponse(epsilon=20.0, delta=0.00014999500000001387, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df= species nb_penguin avg_bill_length_mm std_bill_length_mm 0 Adelie 26 48.965521 9.278141 1 Chinstrap 32 47.942201 10.567298 2 Gentoo 40 46.568333 10.015278))
If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway.
from lomas_core.error_handler import InvalidQueryException
flipper_length_response = None
for i in range(0, 10):
try:
flipper_length_response = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4)
break
except InvalidQueryException:
print(f"Attempt number {i + 1} failed.")
continue
if flipper_length_response is None:
print("10 unsuccessfull query attempts.")
And now she should not have any remaining budget:
But she can do her post-processing and hypothesis analysis.
df_flipper = flipper_length_response.result.df
df_flipper
| species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | |
|---|---|---|---|---|
| 0 | Adelie | 150 | 38.841433 | 3.187632 |
| 1 | Chinstrap | 68 | 48.653061 | 5.506999 |
| 2 | Gentoo | 123 | 47.730977 | 1.975289 |
And finally the $t$-test:
def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))
nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]
t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)
print(f"T test between species 0 and specie 1: {np.round(t_01, 2)}. Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between species 1 and specie 2: {np.round(t_12, 2)}. Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")
T test between species 0 and specie 1: -17.18. Reject null hypothesis: True. T test between species 0 and specie 2: -27.66. Reject null hypothesis: True. T test between species 1 and specie 2: 1.89. Reject null hypothesis: True.
All t-tests are above the critical value of 0.5. She can reject the null hypothesis.
She finally computes the confidence intervals for the flipper length of each species
ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper
| species | nb_penguin | avg_bill_length_mm | std_bill_length_mm | standard_error | ci_95_lower_bound | ci_95_upper_bound | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | 150 | 38.841433 | 3.187632 | 0.260269 | 38.331305 | 39.351560 |
| 1 | Chinstrap | 68 | 48.653061 | 5.506999 | 0.667822 | 47.344131 | 49.961992 |
| 2 | Gentoo | 123 | 47.730977 | 1.975289 | 0.178106 | 47.381890 | 48.080065 |