Lomas Client Side: Using Smartnoise-SQL

This notebook showcases how researcher could use lomas platform with Smartnoise-SQL. It explains the different functionnalities provided by the lomas-client client library to interact with lomas server.

The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.

Each user has access to one or multiple projects and for each dataset has a limited budget \(\epsilon\), \(\delta\).

In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data.

Step 1: Install the library

To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client on her local developping environment.

It can be installed via the pip command:

[1]:
# !pip install lomas_client

Or using a local version of the client

[2]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
[3]:
from lomas_client.client import Client
import numpy as np

Step 2: Initialise the client

Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.

To create the client, Dr. Antartica needs to give it a few parameters: - a url: the root application endpoint to the remote secure server. - user_name: her name as registered in the database (Dr. Alice Antartica) - dataset_name: the name of the dataset that she wants to query (PENGUIN)

She will only be able to query on the real dataset if the administrator has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some \(\epsilon\), \(\delta\) privacy loss budget.

[4]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)

And that’s it for the preparation. She is now ready to use the various functionnalities offered by lomas-client.

Step 3: Getting dataset metadata

[5]:
metadata = client.get_dataset_metadata()
metadata
[5]:
{'max_ids': 1,
 'rows': 344,
 'row_privacy': True,
 'censor_dims': False,
 'columns': {'species': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
  'island': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Torgersen', 'Biscoe', 'Dream']},
  'bill_length_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 30.0,
   'upper': 65.0},
  'bill_depth_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 13.0,
   'upper': 23.0},
  'flipper_length_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 150.0,
   'upper': 250.0},
  'body_mass_g': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 2000.0,
   'upper': 7000.0},
  'sex': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 2,
   'categories': ['MALE', 'FEMALE']}}}
[6]:
nb_penguin = metadata['rows']
print(f"Number of penguins: {nb_penguin}.")
Number of penguins: 344.
[7]:
columns = metadata["columns"].keys()
columns
[7]:
dict_keys(['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'])

Step 4: Average bill length with Smartnoise-SQL

Query dummy dataset

Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this, other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.

It is possible to specify the flag dummy=True in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them.

[8]:
# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"
[9]:
EPSILON = 0.5
DELTA = 1e-5
[10]:
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    dummy = True,
)
[11]:
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
[11]:
'Average bill length on dummy: 45.27mm.'

Estimate cost of a query

Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an epsilon and a delta.

[12]:
cost = client.smartnoise_sql.cost(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
)
[13]:
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
[13]:
'This query would actually cost her 1.0 epsilon and 5.000000000032756e-06 delta.'

This is actually twice as much as what she initially put in. In the background, Smartnoise-SQL decomposes the DP query in multiple other queries and the budget given as input is spent on each of these sub-queries. Here for the average, we need a sum divided by a count, hence EPSILON is spent once for the sum and then once more for the count. (see NOTE below for tips and explanation).

Overide DP mechanism

She wants to use another DP-mechanism for this query. She can change it via the mechanism argument. See Smartnoise-SQL documentation here for overriding mechanisms.

[14]:
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = True,
)
[15]:
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
[15]:
'Average bill length on dummy: 41.37mm.'
[16]:
cost = client.smartnoise_sql.cost(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"}
)
cost
[16]:
CostResponse(epsilon=1.0, delta=1.4999949999983109e-05)

Query real dataset

Dr. Antartica is ready to query the real dataset and get a differentially private response for the average bill length. The dummy flag is False by default, so setting it is optional. She uses the values of epsilon and delta that she selected just before.

Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query.

[17]:
avg_bill_length_response = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = False
)
[18]:
avg_bill_length = avg_bill_length_response.result.df['avg_bill_length_mm'].iloc[0]
print(f"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.")
Average bill length on private data: 43.5mm.

After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:

Postprocess

It is also possible to use the ‘postprocess’ argument from Smartnoise-SQL see its documentation here by specifying it in the query.

[19]:
dummy_res = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = True,
    dummy = True,
)
dummy_res
[19]:
QueryResponse(epsilon=1.0, delta=5.000000000032756e-06, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df=   avg_bill_length_mm
0           50.016594))
[20]:
dummy_res = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = False,
    dummy = True,
)
dummy_res
[20]:
QueryResponse(epsilon=1.0, delta=5.000000000032756e-06, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df=         res_0       res_1
0  4304.177184  101.480531))

Step 4: Penguin statistics

Confidence intervals for flipper length over the whole population

She is first interested to have a better idea of the distribution of bill length of all species. She already has the number of penguins (=number of rows as max_ids=1) from the metadata and the average bill length from step 3, so she just needs to compute the standard deviation. As it is just an exploration step, she uses very little budget values.

[21]:
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"

She again first verifies that her query works on the dummy dataset:

[22]:
dummy_res = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = 0.5,
    delta = 1e-5,
    dummy = True
)
[23]:
dummy_std = np.round(dummy_res.result.df['std_bill_length_mm'].iloc[0], 2)
f"The dummy standard variation is {dummy_std}."
[23]:
'The dummy standard variation is 10.09.'

The syntax of the query works, now she checks the budget:

[24]:
cost = client.smartnoise_sql.cost(
    query = QUERY,
    epsilon = 0.5,
    delta = 1e-5
)
[25]:
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
[25]:
'This query would actually cost her 1.5 epsilon and 5.000000000032756e-06 delta.'

This time it is three times the budget because the standard deviation needs the average, then a difference and a count again.

[26]:
response = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = 0.5,
    delta = 1e-5
)
[27]:
std_bill_length = response.result.df['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")
Standard deviation of bill length: 6.82.

She can now do all the postprocessing that she wants with the returned data without increasing the privacy risk.

[28]:
# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")
Standard error of bill length: 0.37.
[29]:
 # Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")
The 95% confidence interval of the bill length of all penguins is [42.78, 44.22].

Note on budget with Smartnoise-SQL (Advanced)

All of these queries will cost the same budget in Smartnoise-SQL. The reason is that the smartnoise-sql translates the input query in sub queries, finds the answer for each sub query for the budget in input and then assembles the results. For the first ‘standard deviation’ query, it requires a count, an average, and only then the computation for the standard deviation. Hence, to save budget it is better to make a general query directly and retrieve all the ‘sub-answers’.

[30]:
epsilon = 1.0
delta = 1e-5
[31]:
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[31]:
CostResponse(epsilon=3.0, delta=5.000000000032756e-06)
[32]:
QUERY = "SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[32]:
CostResponse(epsilon=3.0, delta=5.000000000032756e-06)
[33]:
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[33]:
CostResponse(epsilon=3.0, delta=5.000000000032756e-06)

A way to know the sub-queries of a query is to use the following Smartnoise-SQL code:

[34]:
# Convert metadata to Smartnoise-SQL compliant metadata
metadata = dict(metadata)
metadata.update(metadata["columns"])
del metadata["columns"]
snsql_metadata = {"": {"": {"df": metadata}}}
[35]:
# Write the query to inspect
QUERY = "SELECT STD(bill_length_mm) as std_bl FROM df"
#QUERY = "SELECT COUNT(*) as nb_row FROM df"
[36]:
from snsql.sql.private_rewriter import Rewriter
rewriter = Rewriter(snsql_metadata)
rewriter.options.row_privacy = metadata["row_privacy"]
rewriter.options.max_contrib = metadata["max_ids"]
dp_query = rewriter.query(QUERY)
dp_query
[36]:
<snsql._ast.ast.Query at 0x7cbd404b92e0>

The original dp query is represented as one query:

[37]:
dp_query._named_symbols
[37]:
{'std_bl': <snsql._ast.tokens.Symbol at 0x7cbd404ba0f0>}

But has 4 named symbols inside: 2 alias for the 2 SQL subqueries - ‘keycount’ for ‘count_bill_length_mm’, - ‘sum_alias_0xxxx’ for ‘sum_bill_length_mm’

[38]:
subquery = dp_query.source.relations[0].primary.query
syms = subquery._named_symbols
syms
[38]:
{'keycount': <snsql._ast.tokens.Symbol at 0x7cbd404afe90>,
 'sum_alias_0xd1a8': <snsql._ast.tokens.Symbol at 0x7cbd40698560>,
 'count_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7cbd40632ae0>,
 'sum_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7cbd40605b50>}

This last query with group_by will cost the same because max_ids=1 (a penguin appears in the dataset at most once) and so the group_by is applied on different partitions of the population.

[39]:
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
[39]:
CostResponse(epsilon=3.0, delta=5.000000000032756e-06)

NOTE: in the current code of Smartnoise-SQL, there is no odometer. Meaning all queries are independant. If someone first queries the private dataset for a count, then a second time for the average and then for the standard deviation then the total cost will be added: 3 count + 2 average + 1 std. That’s why it is better to do everything in one query.

Hypothesis testing

So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.

She will do a two-tailed two-sample \(t\) test.

  • The null hypothese \(H_0\) is flipper length does not differ between species.

  • The alternative hypothesis \(H_a\) is flipper length does differ between species.

She set the level of significance at 0.05.

[40]:
CRITICAL_VALUE = 0.05
[41]:
QUERY = "SELECT \
        species AS species, \
        COUNT(bill_length_mm) AS nb_penguin,  \
        AVG(bill_length_mm) AS avg_bill_length_mm, \
        STD(bill_length_mm) AS std_bill_length_mm \
        FROM df GROUP BY species"

She checks the remaining budget:

[42]:
client.get_remaining_budget()
[42]:
{'remaining_epsilon': 7.5, 'remaining_delta': 0.004980000049999984}

She estimates how much budget it would really cost:

[43]:
client.smartnoise_sql.cost(query = QUERY, epsilon = 1.0, delta = 1e-4)
[43]:
CostResponse(epsilon=3.0, delta=4.999999999999449e-05)

The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly.

[44]:
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
dummy_res
[44]:
QueryResponse(epsilon=0.30000000000000004, delta=4.999999969612645e-09, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df=     species  nb_penguin  avg_bill_length_mm  std_bill_length_mm
0     Adelie          29           52.132857            5.262345
1  Chinstrap          32           47.583090           47.323825
2     Gentoo          96           19.885250           23.327595))

She did not give enough budget for the query to work. This is why there are ‘NANs’ in the output. She has to spend more budget for the query to work.

[45]:
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 7.5/3, delta = 1e-4, dummy = True)
dummy_res
[45]:
QueryResponse(epsilon=7.5, delta=4.999999999999449e-05, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df=     species  nb_penguin  avg_bill_length_mm  std_bill_length_mm
0     Adelie          27           47.188038           16.634112
1  Chinstrap          33           47.524834            5.514985
2     Gentoo          39           48.392045            6.535751))

If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway.

[46]:
flipper_length_response = client.smartnoise_sql.query(query = QUERY, epsilon = 7.5/3, delta = 1e-4)

And now she should not have any remaining budget:

But she can do her post-processing and hypothesis analysis.

[47]:
df_flipper = flipper_length_response.result.df
df_flipper
[47]:
species nb_penguin avg_bill_length_mm std_bill_length_mm
0 Adelie 151 38.370608 5.411184
1 Chinstrap 67 49.328525 3.933794
2 Gentoo 122 46.562629 10.145133

And finally the \(t\)-test:

[48]:
def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
    standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
    return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))
[49]:
nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]
[50]:
t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)

print(f"T test between species 0 and specie 1: {np.round(t_01, 2)}.  Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between species 1 and specie 2: {np.round(t_12, 2)}.   Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")
T test between species 0 and specie 1: -15.05.  Reject null hypothesis: True.
T test between species 0 and specie 2: -8.94. Reject null hypothesis: True.
T test between species 1 and specie 2: 2.29.   Reject null hypothesis: True.

All t-tests are above the critical value of 0.5. She can reject the null hypothesis.

She finally computes the confidence intervals for the flipper length of each species

[51]:
ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper
[51]:
species nb_penguin avg_bill_length_mm std_bill_length_mm standard_error ci_95_lower_bound ci_95_upper_bound
0 Adelie 151 38.370608 5.411184 0.440356 37.507511 39.233706
1 Chinstrap 67 49.328525 3.933794 0.480589 48.386570 50.270481
2 Gentoo 122 46.562629 10.145133 0.918497 44.762374 48.362883
[ ]:

[ ]: