Using Smartnoise-SQL¶

This notebook showcases how researcher could use lomas platform with Smartnoise-SQL. It explains the different functionnalities provided by the lomas-client client library to interact with lomas server.

The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.

Each user has access to one or multiple projects and for each dataset has a limited budget $\epsilon$, $\delta$.

In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data.

Step 1: Install the library¶

To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client on her local developping environment.

It can be installed via the pip command:

Copied!

# !pip install lomas_client
# !pip install lomas_client

Or using a local version of the client

Copied!

import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))

Copied!

from lomas_client.client import Client
from csvw_eo.csvw_to_smartnoise_sql import csvw_to_smartnoise_sql
import numpy as np
from lomas_client.client import Client
from csvw_eo.csvw_to_smartnoise_sql import csvw_to_smartnoise_sql
import numpy as np

Step 2: Initialise the client¶

Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.

The client needs a few parameters to be created. Usually, these would be set in the environment by the system administrator (queen Icebergina) and be transparent to lomas users. In this instance, the following code snippet sets a few of these parameters that are specific to this notebook.

She will only be able to query on the real dataset if the administrator has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some $\epsilon$, $\delta$ privacy loss budget.

Copied!





# The following would usually be set in the environment by a system administrator
# and be tranparent to lomas users.
# Uncomment them if you are running against a Kubernetes deployment.
# They have already been set for you if you are running locally within a devenv or the Jupyter lab set up by Docker compose.

import os
# os.environ["LOMAS_CLIENT_APP_URL"] = "https://lomas.example.com:443"
# os.environ["LOMAS_CLIENT_OIDC_DISCOVERY_URL"] = "https://dex.example.com:443/.well-known/openid-configuration"
# os.environ["LOMAS_CLIENT_TELEMETRY__ENABLED"] = "false"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_ENDPOINT"] = "http://otel.example.com:445"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_INSECURE"] = "true"
# os.environ["LOMAS_CLIENT_TELEMETRY__SERVICE_ID"] = "my-app-client"
# os.environ["LOMAS_CLIENT_REALM"] = "lomas"

# We set these ones because they are specific to this notebook.

os.environ["LOMAS_CLIENT_USER_NAME"] = "dr.antartica@example.com"
os.environ["LOMAS_CLIENT_USER_PASSWORD"] = "dr.antartica"
os.environ["LOMAS_CLIENT_DATASET_NAME"] = "PENGUIN"

# Note that all client settings can also be passed as keyword arguments to the Client constructor.
# eg. client = Client(user_name = "Dr.Antartica") takes precedence over setting the "LOMAS_CLIENT_USER_NAME"
# environment variable.
# The following would usually be set in the environment by a system administrator
# and be tranparent to lomas users.
# Uncomment them if you are running against a Kubernetes deployment.
# They have already been set for you if you are running locally within a devenv or the Jupyter lab set up by Docker compose.

import os
# os.environ["LOMAS_CLIENT_APP_URL"] = "https://lomas.example.com:443"
# os.environ["LOMAS_CLIENT_OIDC_DISCOVERY_URL"] = "https://dex.example.com:443/.well-known/openid-configuration"
# os.environ["LOMAS_CLIENT_TELEMETRY__ENABLED"] = "false"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_ENDPOINT"] = "http://otel.example.com:445"
# os.environ["LOMAS_CLIENT_TELEMETRY__COLLECTOR_INSECURE"] = "true"
# os.environ["LOMAS_CLIENT_TELEMETRY__SERVICE_ID"] = "my-app-client"
# os.environ["LOMAS_CLIENT_REALM"] = "lomas"

# We set these ones because they are specific to this notebook.

os.environ["LOMAS_CLIENT_USER_NAME"] = "dr.antartica@example.com"
os.environ["LOMAS_CLIENT_USER_PASSWORD"] = "dr.antartica"
os.environ["LOMAS_CLIENT_DATASET_NAME"] = "PENGUIN"

# Note that all client settings can also be passed as keyword arguments to the Client constructor.
# eg. client = Client(user_name = "Dr.Antartica") takes precedence over setting the "LOMAS_CLIENT_USER_NAME"
# environment variable.

Copied!

client = Client()
client = Client()

[09:18:56] WARNING  OIDC IdP or Lomas service configured without TLS -> using insecure transport  http_client.py:35

And that's it for the preparation. She is now ready to use the various functionnalities offered by lomas-client.

Step 3: Getting dataset metadata¶

Copied!

metadata = client.get_dataset_metadata()
metadata
metadata = client.get_dataset_metadata()
metadata

{'@context': ['http://www.w3.org/ns/csvw', 'csvw-eo-context.jsonld'],
 '@type': 'Table',
 'privacyUnit': None,
 'maxContributions': 1,
 'maxLength': 344,
 'publicLength': 344,
 'tableSchema': {'columns': [{'@type': 'Column',
    'name': 'species',
    'datatype': <DataTypes.STRING: 'string'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'keys': ['Adelie', 'Chinstrap', 'Gentoo'],
    'maxNumPartitions': 3},
   {'@type': 'Column',
    'name': 'island',
    'datatype': <DataTypes.STRING: 'string'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'keys': ['Torgersen', 'Biscoe', 'Dream'],
    'maxNumPartitions': 3},
   {'@type': 'Column',
    'name': 'bill_length_mm',
    'datatype': <DataTypes.FLOAT: 'float'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'minimum': 30.0,
    'maximum': 65.0},
   {'@type': 'Column',
    'name': 'bill_depth_mm',
    'datatype': <DataTypes.FLOAT: 'float'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'minimum': 13.0,
    'maximum': 23.0},
   {'@type': 'Column',
    'name': 'flipper_length_mm',
    'datatype': <DataTypes.FLOAT: 'float'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'minimum': 150.0,
    'maximum': 250.0},
   {'@type': 'Column',
    'name': 'body_mass_g',
    'datatype': <DataTypes.FLOAT: 'float'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'minimum': 2000.0,
    'maximum': 7000.0},
   {'@type': 'Column',
    'name': 'sex',
    'datatype': <DataTypes.STRING: 'string'>,
    'required': None,
    'privacyId': None,
    'nullableProportion': 0.0,
    'keys': ['MALE', 'FEMALE'],
    'maxNumPartitions': 2}]}}

Copied!

nb_penguin = metadata['publicLength']
print(f"Number of penguins: {nb_penguin}.")
nb_penguin = metadata['publicLength']
print(f"Number of penguins: {nb_penguin}.")

Number of penguins: 344.

Step 4: Average bill length with Smartnoise-SQL¶

Query dummy dataset¶

Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this, other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.

It is possible to specify the flag dummy=True in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them.

Copied!

# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"
# Average bill length in mm
QUERY = "SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df"

Copied!

EPSILON = 0.5
DELTA = 1e-5
EPSILON = 0.5
DELTA = 1e-5

Copied!





# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    dummy = True,
)
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    dummy = True,
)

Copied!

avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."

'Average bill length on dummy: 44.16mm.'

Estimate cost of a query¶

Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an epsilon and a delta.

Copied!





cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = EPSILON, 
    delta = DELTA,
)
cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = EPSILON, 
    delta = DELTA,
)

Copied!

f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'

'This query would actually cost her 1.5 epsilon and 1.4999949999983109e-05 delta.'

This is actually twice as much as what she initially put in. In the background, Smartnoise-SQL decomposes the DP query in multiple other queries and the budget given as input is spent on each of these sub-queries. Here for the average, we need a sum divided by a count, hence EPSILON is spent once for the sum and then once more for the count. (see NOTE below for tips and explanation).

Overide DP mechanism¶

She wants to use another DP-mechanism for this query. She can change it via the mechanism argument. See Smartnoise-SQL documentation here for overriding mechanisms.

Copied!





# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = True,
)
# On the remote server dummy dataframe
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = True,
)

Copied!

avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."
avg_bl_dummy = np.round(dummy_res.result.df["avg_bill_length_mm"][0], 2)
f"Average bill length on dummy: {avg_bl_dummy}mm."

'Average bill length on dummy: 50.17mm.'

Copied!





cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = EPSILON, 
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"}
)
cost
cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = EPSILON, 
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"}
)
cost

CostResponse(epsilon=1.5, delta=2.49998000004803e-05)

Query real dataset¶

Dr. Antartica is ready to query the real dataset and get a differentially private response for the average bill length. The dummy flag is False by default, so setting it is optional. She uses the values of epsilon and delta that she selected just before.

Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query.

Copied!





avg_bill_length_response = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON, 
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = False
)
avg_bill_length_response = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON, 
    delta = DELTA,
    mechanisms = {"count": "gaussian", "sum_float": "laplace"},
    dummy = False
)

Copied!

avg_bill_length = avg_bill_length_response.result.df['avg_bill_length_mm'].iloc[0]
print(f"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.")
avg_bill_length = avg_bill_length_response.result.df['avg_bill_length_mm'].iloc[0]
print(f"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.")

Average bill length on private data: 42.36mm.

After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:

Postprocess¶

It is also possible to use the 'postprocess' argument from Smartnoise-SQL see its documentation here by specifying it in the query.

Copied!





dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = True,
    dummy = True,
)
dummy_res
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = True,
    dummy = True,
)
dummy_res

QueryResponse(epsilon=1.5, delta=1.4999949999983109e-05, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df=   avg_bill_length_mm
0           38.707424))

Copied!





dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = False,
    dummy = True,
)
dummy_res
dummy_res = client.smartnoise_sql.query(
    query = QUERY,  
    epsilon = EPSILON,
    delta = DELTA,
    postprocess = False,
    dummy = True,
)
dummy_res

QueryResponse(epsilon=1.5, delta=1.4999949999983109e-05, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df=       res_0        res_1       res_2
0  93.989285  4920.846654  101.370471))

Step 4: Penguin statistics¶

Confidence intervals for flipper length over the whole population¶

She is first interested to have a better idea of the distribution of bill length of all species. She already has the number of penguins (=number of rows as max_ids=1) from the metadata and the average bill length from step 3, so she just needs to compute the standard deviation. As it is just an exploration step, she uses very little budget values.

Copied!

QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"

She again first verifies that her query works on the dummy dataset:

Copied!





dummy_res = client.smartnoise_sql.query(
    query = QUERY, 
    epsilon = 0.5, 
    delta = 1e-5, 
    dummy = True
)
dummy_res = client.smartnoise_sql.query(
    query = QUERY, 
    epsilon = 0.5, 
    delta = 1e-5, 
    dummy = True
)

Copied!

dummy_std = np.round(dummy_res.result.df['std_bill_length_mm'].iloc[0], 2)
f"The dummy standard variation is {dummy_std}."
dummy_std = np.round(dummy_res.result.df['std_bill_length_mm'].iloc[0], 2)
f"The dummy standard variation is {dummy_std}."

'The dummy standard variation is 9.36.'

The syntax of the query works, now she checks the budget:

Copied!





cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = 0.5, 
    delta = 1e-5
)
cost = client.smartnoise_sql.cost(
    query = QUERY, 
    epsilon = 0.5, 
    delta = 1e-5
)

Copied!

f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'
f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'

'This query would actually cost her 2.0 epsilon and 1.4999949999983109e-05 delta.'

This time it is three times the budget because the standard deviation needs the average, then a difference and a count again.

Copied!





response = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = 0.5,
    delta = 1e-5
)
response = client.smartnoise_sql.query(
    query = QUERY,
    epsilon = 0.5,
    delta = 1e-5
)

Copied!

std_bill_length = response.result.df['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")
std_bill_length = response.result.df['std_bill_length_mm'].iloc[0]
print(f"Standard deviation of bill length: {np.round(std_bill_length, 2)}.")

Standard deviation of bill length: 3.02.

She can now do all the postprocessing that she wants with the returned data without increasing the privacy risk.

Copied!

# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")
# Get standard error
standard_error = std_bill_length/np.sqrt(nb_penguin)
print(f"Standard error of bill length: {np.round(standard_error, 2)}.")

Standard error of bill length: 0.16.

Copied!





 # Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")
 # Compute the 95% confidence interval
ZSCORE = 1.96
lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error
print(f"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].")

The 95% confidence interval of the bill length of all penguins is [42.04, 42.67].

Note on budget with Smartnoise-SQL (Advanced)¶

All of these queries will cost the same budget in Smartnoise-SQL. The reason is that the smartnoise-sql translates the input query in sub queries, finds the answer for each sub query for the budget in input and then assembles the results. For the first 'standard deviation' query, it requires a count, an average, and only then the computation for the standard deviation. Hence, to save budget it is better to make a general query directly and retrieve all the 'sub-answers'.

Copied!

epsilon = 1.0
delta = 1e-5
epsilon = 1.0
delta = 1e-5

Copied!

QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
QUERY = "SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost

CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)

Copied!

QUERY = "SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
QUERY = "SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost

CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)

Copied!

QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost

CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)

A way to know the sub-queries of a query is to use the following Smartnoise-SQL code:

Copied!

# Convert metadata to Smartnoise-SQL compliant metadata
snsql_metadata = csvw_to_smartnoise_sql(metadata)
snsql_metadata
# Convert metadata to Smartnoise-SQL compliant metadata
snsql_metadata = csvw_to_smartnoise_sql(metadata)
snsql_metadata

{'': {'': {'df': {'max_ids': 1,
    'rows': 344,
    'species': {'name': 'species', 'type': 'string', 'nullable': True},
    'island': {'name': 'island', 'type': 'string', 'nullable': True},
    'bill_length_mm': {'name': 'bill_length_mm',
     'type': 'float',
     'nullable': True,
     'lower': 30.0,
     'upper': 65.0},
    'bill_depth_mm': {'name': 'bill_depth_mm',
     'type': 'float',
     'nullable': True,
     'lower': 13.0,
     'upper': 23.0},
    'flipper_length_mm': {'name': 'flipper_length_mm',
     'type': 'float',
     'nullable': True,
     'lower': 150.0,
     'upper': 250.0},
    'body_mass_g': {'name': 'body_mass_g',
     'type': 'float',
     'nullable': True,
     'lower': 2000.0,
     'upper': 7000.0},
    'sex': {'name': 'sex', 'type': 'string', 'nullable': True},
    'row_privacy': True}}}}

Copied!

# Write the query to inspect
QUERY = "SELECT STD(bill_length_mm) as std_bl FROM df"
#QUERY = "SELECT COUNT(*) as nb_row FROM df"
# Write the query to inspect
QUERY = "SELECT STD(bill_length_mm) as std_bl FROM df"
#QUERY = "SELECT COUNT(*) as nb_row FROM df"

Copied!





from snsql.sql.private_rewriter import Rewriter
rewriter = Rewriter(snsql_metadata)
rewriter.options.row_privacy = snsql_metadata[""][""]["df"]["row_privacy"]
rewriter.options.max_contrib = snsql_metadata[""][""]["df"]["max_ids"]
dp_query = rewriter.query(QUERY)
dp_query
from snsql.sql.private_rewriter import Rewriter
rewriter = Rewriter(snsql_metadata)
rewriter.options.row_privacy = snsql_metadata[""][""]["df"]["row_privacy"]
rewriter.options.max_contrib = snsql_metadata[""][""]["df"]["max_ids"]
dp_query = rewriter.query(QUERY)
dp_query

<snsql._ast.ast.Query at 0x7c6c94ab1250>

The original dp query is represented as one query:

Copied!

dp_query._named_symbols
dp_query._named_symbols

{'std_bl': <snsql._ast.tokens.Symbol at 0x7c6c94aa8550>}

But has 4 named symbols inside: 2 alias for the 2 SQL subqueries

'keycount' for 'count_bill_length_mm',
'sum_alias_0xxxx' for 'sum_bill_length_mm'

Copied!

subquery = dp_query.source.relations[0].primary.query
syms = subquery._named_symbols
syms
subquery = dp_query.source.relations[0].primary.query
syms = subquery._named_symbols
syms

{'keycount': <snsql._ast.tokens.Symbol at 0x7c6ca85993d0>,
 'sum_alias_0xf3d7': <snsql._ast.tokens.Symbol at 0x7c6ca87c2990>,
 'count_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7c6ca87c2d00>,
 'sum_bill_length_mm': <snsql._ast.tokens.Symbol at 0x7c6c94aa8370>}

This last query with group_by will cost the same because max_ids=1 (a penguin appears in the dataset at most once) and so the group_by is applied on different partitions of the population.

Copied!

QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost
QUERY = "SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species"
cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)
cost

CostResponse(epsilon=4.0, delta=1.4999949999983109e-05)

NOTE: in the current code of Smartnoise-SQL, there is no odometer. Meaning all queries are independant. If someone first queries the private dataset for a count, then a second time for the average and then for the standard deviation then the total cost will be added: 3 count + 2 average + 1 std. That's why it is better to do everything in one query.

Hypothesis testing¶

So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.

She will do a two-tailed two-sample $t$ test.

The null hypothese $H_0$ is flipper length does not differ between species.
The alternative hypothesis $H_a$ is flipper length does differ between species.

She set the level of significance at 0.05.

Copied!

CRITICAL_VALUE = 0.05
CRITICAL_VALUE = 0.05

Copied!





QUERY = "SELECT \
        species AS species, \
        COUNT(bill_length_mm) AS nb_penguin,  \
        AVG(bill_length_mm) AS avg_bill_length_mm, \
        STD(bill_length_mm) AS std_bill_length_mm \
        FROM df GROUP BY species"
QUERY = "SELECT \
        species AS species, \
        COUNT(bill_length_mm) AS nb_penguin,  \
        AVG(bill_length_mm) AS avg_bill_length_mm, \
        STD(bill_length_mm) AS std_bill_length_mm \
        FROM df GROUP BY species"

She checks the remaining budget:

Copied!

client.get_remaining_budget()
client.get_remaining_budget()

RemainingBudgetResponse(remaining_epsilon=46.5, remaining_delta=0.004960000249999537)

She estimates how much budget it would really cost:

Copied!

client.smartnoise_sql.cost(query = QUERY, epsilon = 1.0, delta = 1e-4)
client.smartnoise_sql.cost(query = QUERY, epsilon = 1.0, delta = 1e-4)

CostResponse(epsilon=4.0, delta=0.00014999500000001387)

The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly.

raises-exception

Copied!

dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
dummy_res
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
dummy_res

---------------------------------------------------------------------------
ExternalLibraryException                  Traceback (most recent call last)
Cell In[43], line 1
----> 1 dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)
      2 dummy_res

File ~/Desktop/lomas/client/lomas_client/libraries/smartnoise_sql.py:119, in SmartnoiseSQLClient.query(self, query, epsilon, delta, mechanisms, postprocess, dummy, nb_rows, seed)
    116 body = request_model.model_validate(body_dict)
    117 res = self.http_client.post(endpoint, body)
--> 119 return validate_model_response(self.http_client, res, QueryResponse)

File ~/Desktop/lomas/client/lomas_client/utils.py:69, in validate_model_response(client, response, response_model)
     67 if job.status == "failed":
     68     assert job.error is not None, f"job {job_uid} failed without error !"
---> 69     raise_error_from_model(job.error)
     71 return response_model.model_validate(job.result)

File ~/Desktop/lomas/core/lomas_core/error_handler.py:150, in raise_error_from_model(error_model)
    148     raise InvalidQueryException(error_model.message)
    149 case ExternalLibraryExceptionModel():
--> 150     raise ExternalLibraryException(error_model.library, error_model.message)
    151 case UnauthorizedAccessExceptionModel():
    152     raise UnauthorizedAccessException(error_model.message)

ExternalLibraryException: (<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, 'SQL Reader generated empty results. Epsilon: 0.1 and Delta: 1e-08 are too small to generate output.')

She did not give enough budget for the query to work. This is why there are 'NANs' in the output. She has to spend more budget for the query to work.

Copied!

dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4, dummy = True)

dummy_res
dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4, dummy = True)

dummy_res

QueryResponse(epsilon=20.0, delta=0.00014999500000001387, requested_by='Dr.Antartica', result=SmartnoiseSQLQueryResult(type=<DPLibraries.SMARTNOISE_SQL: 'smartnoise_sql'>, df=     species  nb_penguin  avg_bill_length_mm  std_bill_length_mm
0     Adelie          26           48.965521            9.278141
1  Chinstrap          32           47.942201           10.567298
2     Gentoo          40           46.568333           10.015278))

If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway.

Copied!





from lomas_core.error_handler import InvalidQueryException

flipper_length_response = None

for i in range(0, 10):
    try:
        flipper_length_response = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4)
        break
    except InvalidQueryException:
        print(f"Attempt number {i + 1} failed.")
        continue

if flipper_length_response is None:
    print("10 unsuccessfull query attempts.")
from lomas_core.error_handler import InvalidQueryException

flipper_length_response = None

for i in range(0, 10):
    try:
        flipper_length_response = client.smartnoise_sql.query(query = QUERY, epsilon = 5, delta = 1e-4)
        break
    except InvalidQueryException:
        print(f"Attempt number {i + 1} failed.")
        continue

if flipper_length_response is None:
    print("10 unsuccessfull query attempts.")

And now she should not have any remaining budget:

But she can do her post-processing and hypothesis analysis.

Copied!

df_flipper = flipper_length_response.result.df
df_flipper
df_flipper = flipper_length_response.result.df
df_flipper

	species	nb_penguin	avg_bill_length_mm	std_bill_length_mm
0	Adelie	150	38.841433	3.187632
1	Chinstrap	68	48.653061	5.506999
2	Gentoo	123	47.730977	1.975289

And finally the $t$-test:

Copied!

def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
    standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
    return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))
def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):
    standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)
    return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))

Copied!

nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]
nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]
nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]
nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]

Copied!





t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)

print(f"T test between species 0 and specie 1: {np.round(t_01, 2)}.  Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between species 1 and specie 2: {np.round(t_12, 2)}.   Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")
t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)
t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)
t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)

print(f"T test between species 0 and specie 1: {np.round(t_01, 2)}.  Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.")
print(f"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.")
print(f"T test between species 1 and specie 2: {np.round(t_12, 2)}.   Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.")

T test between species 0 and specie 1: -17.18.  Reject null hypothesis: True.
T test between species 0 and specie 2: -27.66. Reject null hypothesis: True.
T test between species 1 and specie 2: 1.89.   Reject null hypothesis: True.

All t-tests are above the critical value of 0.5. She can reject the null hypothesis.

She finally computes the confidence intervals for the flipper length of each species

Copied!





ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper
ZSCORE = 1.96
df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])
df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']
df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']
df_flipper

	species	nb_penguin	avg_bill_length_mm	std_bill_length_mm	standard_error	ci_95_lower_bound	ci_95_upper_bound
0	Adelie	150	38.841433	3.187632	0.260269	38.331305	39.351560
1	Chinstrap	68	48.653061	5.506999	0.667822	47.344131	49.961992
2	Gentoo	123	47.730977	1.975289	0.178106	47.381890	48.080065