{ "cells": [ { "cell_type": "markdown", "id": "3f18d338", "metadata": {}, "source": [ "# Lomas Client Side: Using Smartnoise-SQL" ] }, { "cell_type": "markdown", "id": "1582a2ae", "metadata": {}, "source": [ "This notebook showcases how researcher could use lomas platform with Smartnoise-SQL. It explains the different functionnalities provided by the `lomas-client` client library to interact with lomas server.\n", "\n", "The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.\n", "\n", "Each user has access to one or multiple projects and for each dataset has a limited budget $\\epsilon$, $\\delta$." ] }, { "cell_type": "markdown", "id": "5b73135c", "metadata": {}, "source": [ "In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data." ] }, { "cell_type": "markdown", "id": "01ae30d2", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": 1, "id": "6f5d749c-0f39-4f78-8157-528bc39764b2", "metadata": {}, "outputs": [], "source": [ "# !pip install lomas_client" ] }, { "cell_type": "markdown", "id": "53cf3204-18a8-423c-9de2-c2966fdf84fb", "metadata": {}, "source": [ "Or using a local version of the client" ] }, { "cell_type": "code", "execution_count": 2, "id": "98b4013c-ea93-4e4d-8885-15aac0039c12", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "sys.path.append(os.path.abspath(os.path.join('..')))" ] }, { "cell_type": "code", "execution_count": 3, "id": "9d96dcd7", "metadata": {}, "outputs": [], "source": [ "from lomas_client.client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "9c63718b", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "To create the client, Dr. Antartica needs to give it a few parameters:\n", "- a url: the root application endpoint to the remote secure server.\n", "- user_name: her name as registered in the database (Dr. Alice Antartica)\n", "- dataset_name: the name of the dataset that she wants to query (PENGUIN)\n", "\n", "She will only be able to query on the real dataset if the administrator has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some $\\epsilon$, $\\delta$ privacy loss budget." ] }, { "cell_type": "code", "execution_count": 4, "id": "941991f7", "metadata": {}, "outputs": [], "source": [ "APP_URL = \"http://lomas_server\"\n", "USER_NAME = \"Dr. Antartica\"\n", "DATASET_NAME = \"PENGUIN\"\n", "client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)" ] }, { "cell_type": "markdown", "id": "0ec400c8", "metadata": {}, "source": [ "And that's it for the preparation. She is now ready to use the various functionnalities offered by `lomas-client`." ] }, { "cell_type": "markdown", "id": "9b9a5f13", "metadata": {}, "source": [ "## Step 3: Getting dataset metadata" ] }, { "cell_type": "code", "execution_count": 5, "id": "d15cbe39", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 344,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metadata = client.get_dataset_metadata()\n", "metadata" ] }, { "cell_type": "code", "execution_count": 6, "id": "ba329ffc-3eaa-4fdd-b526-c1b59c71ed3f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of penguins: 344.\n" ] } ], "source": [ "nb_penguin = metadata['rows']\n", "print(f\"Number of penguins: {nb_penguin}.\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "90e3edb2-54b1-476f-b362-a83e20084a74", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = metadata[\"columns\"].keys()\n", "columns" ] }, { "cell_type": "markdown", "id": "5bf5b471-1495-4046-bec1-ddf96c98642f", "metadata": {}, "source": [ "## Step 4: Average bill length with Smartnoise-SQL" ] }, { "cell_type": "markdown", "id": "69dac96e", "metadata": {}, "source": [ "### Query dummy dataset\n", "\n", "Now that she has an idea of what the data looks like, she wants to start querying the real dataset to for her research. However, before this, other tools are at her disposal to reduce potential error risks and avoid spending budget on irrelevant queries. Of course, this does not have any impact on the budget.\n", "\n", "It is possible to specify the flag `dummy=True` in the various queries to perform the query on the dummy dataset instead of the real dataset and ensure that the queries are doing what is expected of them. " ] }, { "cell_type": "code", "execution_count": 8, "id": "3946425d", "metadata": {}, "outputs": [], "source": [ "# Average bill length in mm\n", "QUERY = \"SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df\"" ] }, { "cell_type": "code", "execution_count": 9, "id": "99494f15-727d-4d03-a099-5cfe5a0c8a27", "metadata": {}, "outputs": [], "source": [ "EPSILON = 0.5\n", "DELTA = 1e-5" ] }, { "cell_type": "code", "execution_count": 10, "id": "90cf2a6d", "metadata": {}, "outputs": [], "source": [ "# On the remote server dummy dataframe\n", "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON,\n", " delta = DELTA,\n", " dummy = True,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "f3a736f7-be77-4214-8f77-6abc7db34793", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Average bill length on dummy: 45.27mm.'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_bl_dummy = np.round(dummy_res.result.df[\"avg_bill_length_mm\"][0], 2)\n", "f\"Average bill length on dummy: {avg_bl_dummy}mm.\"" ] }, { "cell_type": "markdown", "id": "b746374c", "metadata": {}, "source": [ "### Estimate cost of a query\n", "Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an `epsilon` and a `delta`." ] }, { "cell_type": "code", "execution_count": 12, "id": "133020c6", "metadata": {}, "outputs": [], "source": [ "cost = client.smartnoise_sql.cost(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "id": "ff19802d-cb39-48ee-9874-340a4bf2cc31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This query would actually cost her 1.0 epsilon and 5.000000000032756e-06 delta.'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'" ] }, { "cell_type": "markdown", "id": "c255d210-7ba1-4152-8a30-97c7289dd361", "metadata": {}, "source": [ "This is actually twice as much as what she initially put in. In the background, Smartnoise-SQL decomposes the DP query in multiple other queries and the budget given as input is spent on each of these sub-queries. Here for the average, we need a sum divided by a count, hence `EPSILON` is spent once for the sum and then once more for the count. (see NOTE below for tips and explanation)." ] }, { "cell_type": "markdown", "id": "4ec31515-39fe-426b-8339-fc2ac9c1e09e", "metadata": {}, "source": [ "### Overide DP mechanism" ] }, { "cell_type": "markdown", "id": "24b060d0-3c6f-4d35-824f-347ec5103723", "metadata": {}, "source": [ "She wants to use another DP-mechanism for this query. She can change it via the `mechanism` argument. See Smartnoise-SQL documentation [here for overriding mechanisms](https://docs.smartnoise.org/sql/advanced.html#overriding-mechanisms)." ] }, { "cell_type": "code", "execution_count": 14, "id": "1f726ce8-2e3d-462a-bbd8-598198935bc9", "metadata": {}, "outputs": [], "source": [ "# On the remote server dummy dataframe\n", "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON,\n", " delta = DELTA,\n", " mechanisms = {\"count\": \"gaussian\", \"sum_float\": \"laplace\"},\n", " dummy = True,\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "id": "46e064f0-f1e2-49af-8f14-fde44f981813", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Average bill length on dummy: 41.37mm.'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "avg_bl_dummy = np.round(dummy_res.result.df[\"avg_bill_length_mm\"][0], 2)\n", "f\"Average bill length on dummy: {avg_bl_dummy}mm.\"" ] }, { "cell_type": "code", "execution_count": 16, "id": "7e20014d-ad82-4a2d-88d9-ec981150e7db", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.0, delta=1.4999949999983109e-05)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost = client.smartnoise_sql.cost(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA,\n", " mechanisms = {\"count\": \"gaussian\", \"sum_float\": \"laplace\"}\n", ")\n", "cost" ] }, { "cell_type": "markdown", "id": "e5379edf", "metadata": {}, "source": [ "### Query real dataset\n", "Dr. Antartica is ready to query the real dataset and get a differentially private response for the average bill length. The `dummy` flag is False by default, so setting it is optional. She uses the values of `epsilon` and `delta` that she selected just before.\n", "\n", "Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query." ] }, { "cell_type": "code", "execution_count": 17, "id": "69767fac", "metadata": {}, "outputs": [], "source": [ "avg_bill_length_response = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA,\n", " mechanisms = {\"count\": \"gaussian\", \"sum_float\": \"laplace\"},\n", " dummy = False\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "id": "6dbbdf93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average bill length on private data: 43.5mm.\n" ] } ], "source": [ "avg_bill_length = avg_bill_length_response.result.df['avg_bill_length_mm'].iloc[0]\n", "print(f\"Average bill length on private data: {np.round(avg_bill_length, 2)}mm.\")" ] }, { "cell_type": "markdown", "id": "b2767e65", "metadata": {}, "source": [ "After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:" ] }, { "cell_type": "markdown", "id": "1472b825-bcea-458f-930e-41ff0f5d5f93", "metadata": {}, "source": [ "### Postprocess " ] }, { "cell_type": "markdown", "id": "ab34449e-7456-4e5e-b5bb-4231204c4d7e", "metadata": {}, "source": [ "It is also possible to use the 'postprocess' argument from Smartnoise-SQL [see its documentation here](https://docs.smartnoise.org/sql/advanced.html#postprocess) by specifying it in the query." ] }, { "cell_type": "code", "execution_count": 19, "id": "50c38d09-32ea-4269-9ca7-eacfd1d9ad96", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "QueryResponse(epsilon=1.0, delta=5.000000000032756e-06, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df= avg_bill_length_mm\n", "0 50.016594))" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON,\n", " delta = DELTA,\n", " postprocess = True,\n", " dummy = True,\n", ")\n", "dummy_res" ] }, { "cell_type": "code", "execution_count": 20, "id": "df6f2526-612e-4f00-b15a-c0433573e652", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "QueryResponse(epsilon=1.0, delta=5.000000000032756e-06, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df= res_0 res_1\n", "0 4304.177184 101.480531))" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON,\n", " delta = DELTA,\n", " postprocess = False,\n", " dummy = True,\n", ")\n", "dummy_res" ] }, { "cell_type": "markdown", "id": "04929993", "metadata": {}, "source": [ "## Step 4: Penguin statistics" ] }, { "cell_type": "markdown", "id": "bbbca191", "metadata": {}, "source": [ "### Confidence intervals for flipper length over the whole population" ] }, { "cell_type": "markdown", "id": "9d41bd58", "metadata": {}, "source": [ "She is first interested to have a better idea of the distribution of bill length of all species. She already has the number of penguins (=number of rows as `max_ids=1`) from the metadata and the average bill length from step 3, so she just needs to compute the standard deviation. As it is just an exploration step, she uses very little budget values." ] }, { "cell_type": "code", "execution_count": 21, "id": "04b376ef", "metadata": {}, "outputs": [], "source": [ "QUERY = \"SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df\"" ] }, { "cell_type": "markdown", "id": "0b041c81", "metadata": {}, "source": [ "She again first verifies that her query works on the dummy dataset:" ] }, { "cell_type": "code", "execution_count": 22, "id": "5aa9c304", "metadata": {}, "outputs": [], "source": [ "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = 0.5, \n", " delta = 1e-5, \n", " dummy = True\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "id": "49e4ba47-adf3-471b-a35b-c44346ed12a8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The dummy standard variation is 10.09.'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_std = np.round(dummy_res.result.df['std_bill_length_mm'].iloc[0], 2)\n", "f\"The dummy standard variation is {dummy_std}.\"" ] }, { "cell_type": "markdown", "id": "74f68994", "metadata": {}, "source": [ "The syntax of the query works, now she checks the budget:" ] }, { "cell_type": "code", "execution_count": 24, "id": "a8fa2c49", "metadata": {}, "outputs": [], "source": [ "cost = client.smartnoise_sql.cost(\n", " query = QUERY, \n", " epsilon = 0.5, \n", " delta = 1e-5\n", ")" ] }, { "cell_type": "code", "execution_count": 25, "id": "b3aa05ca-3243-4415-a8ec-fb5ad47d244d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This query would actually cost her 1.5 epsilon and 5.000000000032756e-06 delta.'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f'This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.'" ] }, { "cell_type": "markdown", "id": "884f0337-a960-460e-8797-84ddd77974a3", "metadata": {}, "source": [ "This time it is three times the budget because the standard deviation needs the average, then a difference and a count again. " ] }, { "cell_type": "code", "execution_count": 26, "id": "534979fb", "metadata": {}, "outputs": [], "source": [ "response = client.smartnoise_sql.query(\n", " query = QUERY,\n", " epsilon = 0.5,\n", " delta = 1e-5\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "id": "674332e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standard deviation of bill length: 6.82.\n" ] } ], "source": [ "std_bill_length = response.result.df['std_bill_length_mm'].iloc[0]\n", "print(f\"Standard deviation of bill length: {np.round(std_bill_length, 2)}.\")" ] }, { "cell_type": "markdown", "id": "367081be-1159-45d8-9129-88fba20fb697", "metadata": {}, "source": [ "She can now do all the postprocessing that she wants with the returned data without increasing the privacy risk. " ] }, { "cell_type": "code", "execution_count": 28, "id": "f72b19d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standard error of bill length: 0.37.\n" ] } ], "source": [ "# Get standard error\n", "standard_error = std_bill_length/np.sqrt(nb_penguin)\n", "print(f\"Standard error of bill length: {np.round(standard_error, 2)}.\")" ] }, { "cell_type": "code", "execution_count": 29, "id": "62630a03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 95% confidence interval of the bill length of all penguins is [42.78, 44.22].\n" ] } ], "source": [ " # Compute the 95% confidence interval\n", "ZSCORE = 1.96\n", "lower_bound, upper_bound = avg_bill_length - ZSCORE*standard_error, avg_bill_length + ZSCORE*standard_error\n", "print(f\"The 95% confidence interval of the bill length of all penguins is [{np.round(lower_bound, 2)}, {np.round(upper_bound, 2)}].\")" ] }, { "cell_type": "markdown", "id": "26d04824-ff41-4d25-8a4e-1506a416dd0b", "metadata": {}, "source": [ "## Note on budget with Smartnoise-SQL (Advanced)" ] }, { "cell_type": "markdown", "id": "c9aa0b56-bda3-405e-9f33-ae7135dbfeba", "metadata": {}, "source": [ "All of these queries will cost the same budget in Smartnoise-SQL. The reason is that the smartnoise-sql translates the input query in sub queries, finds the answer for each sub query for the budget in input and then assembles the results. For the first 'standard deviation' query, it requires a count, an average, and only then the computation for the standard deviation. Hence, to save budget it is better to make a general query directly and retrieve all the 'sub-answers'." ] }, { "cell_type": "code", "execution_count": 30, "id": "611df7d2-86eb-4710-a6eb-a3de214ece37", "metadata": {}, "outputs": [], "source": [ "epsilon = 1.0\n", "delta = 1e-5" ] }, { "cell_type": "code", "execution_count": 31, "id": "32b76d26-edce-4cf9-bab9-bf1ea936d288", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=3.0, delta=5.000000000032756e-06)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QUERY = \"SELECT STD(bill_length_mm) AS std_bill_length_mm FROM df\"\n", "cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)\n", "cost" ] }, { "cell_type": "code", "execution_count": 32, "id": "f84411ed-dab5-4acc-ab49-bfec9ebc3530", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=3.0, delta=5.000000000032756e-06)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QUERY = \"SELECT AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df\"\n", "cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)\n", "cost" ] }, { "cell_type": "code", "execution_count": 33, "id": "2454db71-4074-46dd-a863-c690c0160c51", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=3.0, delta=5.000000000032756e-06)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QUERY = \"SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df\"\n", "cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)\n", "cost" ] }, { "cell_type": "markdown", "id": "73bd85ca-eed0-488f-807e-6f03f99898cb", "metadata": {}, "source": [ "A way to know the sub-queries of a query is to use the following Smartnoise-SQL code:" ] }, { "cell_type": "code", "execution_count": 34, "id": "5b51cf35-68db-4b11-acbe-8df15b826d10", "metadata": {}, "outputs": [], "source": [ "# Convert metadata to Smartnoise-SQL compliant metadata\n", "metadata = dict(metadata)\n", "metadata.update(metadata[\"columns\"])\n", "del metadata[\"columns\"]\n", "snsql_metadata = {\"\": {\"\": {\"df\": metadata}}}" ] }, { "cell_type": "code", "execution_count": 35, "id": "7ab04c8f-8d79-4871-bc16-1f0368fbd403", "metadata": {}, "outputs": [], "source": [ "# Write the query to inspect\n", "QUERY = \"SELECT STD(bill_length_mm) as std_bl FROM df\"\n", "#QUERY = \"SELECT COUNT(*) as nb_row FROM df\"" ] }, { "cell_type": "code", "execution_count": 36, "id": "a78d7d86-ab95-4521-b84d-49ac795316c3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from snsql.sql.private_rewriter import Rewriter\n", "rewriter = Rewriter(snsql_metadata)\n", "rewriter.options.row_privacy = metadata[\"row_privacy\"]\n", "rewriter.options.max_contrib = metadata[\"max_ids\"]\n", "dp_query = rewriter.query(QUERY)\n", "dp_query" ] }, { "cell_type": "markdown", "id": "2df6bf8c-d06e-4b5c-9509-b2ba01fef581", "metadata": {}, "source": [ "The original dp query is represented as one query:" ] }, { "cell_type": "code", "execution_count": 37, "id": "251b773a-864c-4852-ae89-1472ac768975", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'std_bl': }" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp_query._named_symbols" ] }, { "cell_type": "markdown", "id": "c5830777-95fc-432e-b4c5-6bd59aac514f", "metadata": {}, "source": [ "But has 4 named symbols inside: 2 alias for the 2 SQL subqueries \n", "- 'keycount' for 'count_bill_length_mm',\n", "- 'sum_alias_0xxxx' for 'sum_bill_length_mm'" ] }, { "cell_type": "code", "execution_count": 38, "id": "f4ac4261-e870-4f07-8264-9a2041a35abc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'keycount': ,\n", " 'sum_alias_0xd1a8': ,\n", " 'count_bill_length_mm': ,\n", " 'sum_bill_length_mm': }" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subquery = dp_query.source.relations[0].primary.query\n", "syms = subquery._named_symbols\n", "syms" ] }, { "cell_type": "markdown", "id": "cc07d8c4-153f-4ad3-a977-a971b94d75aa", "metadata": {}, "source": [ "This last query with `group_by` will cost the same because `max_ids=1` (a penguin appears in the dataset at most once) and so the `group_by` is applied on different partitions of the population." ] }, { "cell_type": "code", "execution_count": 39, "id": "5b69f3f2-07dd-48b8-9cd5-64eee53331f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=3.0, delta=5.000000000032756e-06)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "QUERY = \"SELECT COUNT(bill_length_mm) AS count_bl, AVG(bill_length_mm) AS avg_bl, STD(bill_length_mm) as std_bl FROM df GROUP BY species\"\n", "cost = client.smartnoise_sql.cost(query = QUERY, epsilon = epsilon, delta = delta)\n", "cost" ] }, { "cell_type": "markdown", "id": "e20c4673-2c7b-44d5-bd7f-be88d6432a70", "metadata": {}, "source": [ "NOTE: in the current code of Smartnoise-SQL, there is no odometer. Meaning all queries are independant. If someone first queries the private dataset for a count, then a second time for the average and then for the standard deviation then the total cost will be added: 3 count + 2 average + 1 std. That's why it is better to do everything in one query." ] }, { "cell_type": "markdown", "id": "b5ee7ad2", "metadata": {}, "source": [ "### Hypothesis testing" ] }, { "cell_type": "markdown", "id": "0f1a3a8d", "metadata": {}, "source": [ "So, Dr. Antartica has now the opportunity to study the various penguins dimensions and will do an hypotheses testing analysis to discover if flipper length differ between the penguins species.\n", "\n", "She will do a two-tailed two-sample $t$ test.\n", "\n", "- The null hypothese $H_0$ is flipper length does not differ between species.\n", "- The alternative hypothesis $H_a$ is flipper length does differ between species.\n", "\n", "She set the level of significance at 0.05." ] }, { "cell_type": "code", "execution_count": 40, "id": "7d9ae766-4c0d-4dc5-9c9a-5f7eb99718f9", "metadata": {}, "outputs": [], "source": [ "CRITICAL_VALUE = 0.05" ] }, { "cell_type": "code", "execution_count": 41, "id": "5006201d", "metadata": {}, "outputs": [], "source": [ "QUERY = \"SELECT \\\n", " species AS species, \\\n", " COUNT(bill_length_mm) AS nb_penguin, \\\n", " AVG(bill_length_mm) AS avg_bill_length_mm, \\\n", " STD(bill_length_mm) AS std_bill_length_mm \\\n", " FROM df GROUP BY species\"" ] }, { "cell_type": "markdown", "id": "37ce4596-7843-48dd-86cb-fb34b227db0e", "metadata": {}, "source": [ "She checks the remaining budget:" ] }, { "cell_type": "code", "execution_count": 42, "id": "814883fa-a45a-43f2-852d-d5380beff8c0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'remaining_epsilon': 7.5, 'remaining_delta': 0.004980000049999984}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "markdown", "id": "e725eb3f-d12f-4f62-8f57-06fb00639f91", "metadata": {}, "source": [ "She estimates how much budget it would really cost:" ] }, { "cell_type": "code", "execution_count": 43, "id": "0255550b-7fd2-4244-a8eb-da809ddc6a5b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=3.0, delta=4.999999999999449e-05)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.smartnoise_sql.cost(query = QUERY, epsilon = 1.0, delta = 1e-4)" ] }, { "cell_type": "markdown", "id": "56bf804f-d877-48cb-b405-709b30cda3d1", "metadata": {}, "source": [ "The real cost seems to be 3 times the epsilon that she sets. It is a lot but she tries on the dummy dataset to verify all is working properly." ] }, { "cell_type": "code", "execution_count": 44, "id": "80d9933b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "QueryResponse(epsilon=0.30000000000000004, delta=4.999999969612645e-09, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df= species nb_penguin avg_bill_length_mm std_bill_length_mm\n", "0 Adelie 29 52.132857 5.262345\n", "1 Chinstrap 32 47.583090 47.323825\n", "2 Gentoo 96 19.885250 23.327595))" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 0.1, delta = 1e-8, dummy = True)\n", "dummy_res" ] }, { "cell_type": "markdown", "id": "5691680f-8716-4a99-999a-a2bd5ef6a679", "metadata": {}, "source": [ "She did not give enough budget for the query to work. This is why there are 'NANs' in the output. She has to spend more budget for the query to work." ] }, { "cell_type": "code", "execution_count": 45, "id": "0e07fde9-9430-4a12-8337-0503ac162c26", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "QueryResponse(epsilon=7.5, delta=4.999999999999449e-05, requested_by='Dr. Antartica', result=SmartnoiseSQLQueryResult(res_type='sn_sql', df= species nb_penguin avg_bill_length_mm std_bill_length_mm\n", "0 Adelie 27 47.188038 16.634112\n", "1 Chinstrap 33 47.524834 5.514985\n", "2 Gentoo 39 48.392045 6.535751))" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_res = client.smartnoise_sql.query(query = QUERY, epsilon = 7.5/3, delta = 1e-4, dummy = True)\n", "dummy_res" ] }, { "cell_type": "markdown", "id": "e8dadd17", "metadata": {}, "source": [ "If it errors, she might need to re-run the query a few times until it works. The budget is not affected by dummy queries anyway." ] }, { "cell_type": "code", "execution_count": 46, "id": "59f2d665", "metadata": {}, "outputs": [], "source": [ "flipper_length_response = client.smartnoise_sql.query(query = QUERY, epsilon = 7.5/3, delta = 1e-4)" ] }, { "cell_type": "markdown", "id": "78d23e66-2c05-4e35-bba9-7b92710b872a", "metadata": {}, "source": [ "And now she should not have any remaining budget:" ] }, { "cell_type": "markdown", "id": "cb96f406-d409-4531-ac86-05f1c9296705", "metadata": {}, "source": [ "But she can do her post-processing and hypothesis analysis." ] }, { "cell_type": "code", "execution_count": 47, "id": "748f125f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesnb_penguinavg_bill_length_mmstd_bill_length_mm
0Adelie15138.3706085.411184
1Chinstrap6749.3285253.933794
2Gentoo12246.56262910.145133
\n", "
" ], "text/plain": [ " species nb_penguin avg_bill_length_mm std_bill_length_mm\n", "0 Adelie 151 38.370608 5.411184\n", "1 Chinstrap 67 49.328525 3.933794\n", "2 Gentoo 122 46.562629 10.145133" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_flipper = flipper_length_response.result.df\n", "df_flipper" ] }, { "cell_type": "markdown", "id": "099129cf", "metadata": {}, "source": [ "And finally the $t$-test:" ] }, { "cell_type": "code", "execution_count": 48, "id": "0a7d7d4d", "metadata": {}, "outputs": [], "source": [ "def t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2):\n", " standard_error = (std_1 * (nb_1 - 1) + std_2 * (nb_2 - 1))/(nb_1 + nb_2 - 2)\n", " return (avg_1 - avg_2)/np.sqrt(standard_error**2*(1/nb_1 + 1 /nb_2))" ] }, { "cell_type": "code", "execution_count": 49, "id": "bc3ee48a", "metadata": {}, "outputs": [], "source": [ "nb_0, avg_0, std_0 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[0]\n", "nb_1, avg_1, std_1 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[1]\n", "nb_2, avg_2, std_2 = df_flipper[['nb_penguin', 'avg_bill_length_mm', 'std_bill_length_mm']].iloc[2]" ] }, { "cell_type": "code", "execution_count": 50, "id": "1717f9ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "T test between species 0 and specie 1: -15.05. Reject null hypothesis: True.\n", "T test between species 0 and specie 2: -8.94. Reject null hypothesis: True.\n", "T test between species 1 and specie 2: 2.29. Reject null hypothesis: True.\n" ] } ], "source": [ "t_01 = t_test(avg_0, avg_1, std_0, std_1, nb_0, nb_1)\n", "t_02 = t_test(avg_0, avg_2, std_0, std_2, nb_0, nb_2)\n", "t_12 = t_test(avg_1, avg_2, std_1, std_2, nb_1, nb_2)\n", "\n", "print(f\"T test between species 0 and specie 1: {np.round(t_01, 2)}. Reject null hypothesis: {abs(t_01) > CRITICAL_VALUE}.\")\n", "print(f\"T test between species 0 and specie 2: {np.round(t_02, 2)}. Reject null hypothesis: {abs(t_02) > CRITICAL_VALUE}.\")\n", "print(f\"T test between species 1 and specie 2: {np.round(t_12, 2)}. Reject null hypothesis: {abs(t_12) > CRITICAL_VALUE}.\")" ] }, { "cell_type": "markdown", "id": "a557b754-ae7a-490a-8c4b-3928de7101d1", "metadata": {}, "source": [ "All t-tests are above the critical value of 0.5. She can reject the null hypothesis." ] }, { "cell_type": "markdown", "id": "b4df475b", "metadata": {}, "source": [ "She finally computes the confidence intervals for the flipper length of each species" ] }, { "cell_type": "code", "execution_count": 51, "id": "9289bc26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesnb_penguinavg_bill_length_mmstd_bill_length_mmstandard_errorci_95_lower_boundci_95_upper_bound
0Adelie15138.3706085.4111840.44035637.50751139.233706
1Chinstrap6749.3285253.9337940.48058948.38657050.270481
2Gentoo12246.56262910.1451330.91849744.76237448.362883
\n", "
" ], "text/plain": [ " species nb_penguin avg_bill_length_mm std_bill_length_mm \\\n", "0 Adelie 151 38.370608 5.411184 \n", "1 Chinstrap 67 49.328525 3.933794 \n", "2 Gentoo 122 46.562629 10.145133 \n", "\n", " standard_error ci_95_lower_bound ci_95_upper_bound \n", "0 0.440356 37.507511 39.233706 \n", "1 0.480589 48.386570 50.270481 \n", "2 0.918497 44.762374 48.362883 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ZSCORE = 1.96\n", "df_flipper['standard_error'] = df_flipper['std_bill_length_mm']/np.sqrt(df_flipper['nb_penguin'])\n", "df_flipper['ci_95_lower_bound'] = df_flipper['avg_bill_length_mm'] - ZSCORE * df_flipper['standard_error']\n", "df_flipper['ci_95_upper_bound'] = df_flipper['avg_bill_length_mm'] + ZSCORE * df_flipper['standard_error']\n", "df_flipper" ] }, { "cell_type": "code", "execution_count": null, "id": "cce8dde4-0c9e-43d4-b6ce-890ac7c8efd2", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "a630fd2d-86fb-4b8d-afb3-c6d1c31f3ac4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }