{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# S3 example" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, one first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "#!pip install lomas-client" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "from lomas_client.client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "The client needs a few parameters to be created. Usually, these would be set in the environment by the system administrator (queen Icebergina) and be transparent to lomas users. In this instance, the following code snippet sets a few of these parameters that are specific to this notebook. " ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "# The following would usually be set in the environment by a system administrator\n", "# and be tranparent to lomas users. We reset these ones because they are specific to this notebook.\n", "\n", "# Note that all client settings can also be passed as keyword arguments to the Client constructor.\n", "# eg. client = Client(client_id = \"Dr.Antartica\") takes precedence over setting the \"LOMAS_CLIENT_CLIENT_ID\"\n", "# environment variable.\n", "\n", "import os\n", "\n", "USER_NAME = \"Jack\"\n", "os.environ[\"LOMAS_CLIENT_CLIENT_ID\"] = USER_NAME\n", "os.environ[\"LOMAS_CLIENT_CLIENT_SECRET\"] = USER_NAME.lower()\n", "os.environ[\"LOMAS_CLIENT_DATASET_NAME\"] = \"TITANIC\"" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "client = Client()" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Step 3: Understand the functionnalities of the library" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "### Getting dataset metadata" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 887,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'Pclass': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 1,\n", " 'upper': 3},\n", " 'Name': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['male', 'female']},\n", " 'Age': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 0.1,\n", " 'upper': 100.0},\n", " 'SibSp': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 0,\n", " 'upper': 10},\n", " 'Parch': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 0,\n", " 'upper': 10},\n", " 'Ticket': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Fare': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 0.0,\n", " 'upper': 1000.0},\n", " 'Cabin': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Embarked': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['C', 'Q', 'S']},\n", " 'Survived': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'boolean'}}}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic_metadata = client.get_dataset_metadata()\n", "titanic_metadata" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "### Get a dummy dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "NB_ROWS = 200\n", "SEED = 0" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(200, 11)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
03ofemale89.690443662858.435326USTrue
12Dmale58.37367300Z620.908898aCTrue
22ufemale4.11780024h193.917948GSTrue
31omale71.17753497a687.914521ZQTrue
413male56.9456834101758.999002WSTrue
\n", "
" ], "text/plain": [ " Pclass Name Sex Age SibSp Parch Ticket Fare Cabin \\\n", "0 3 o female 89.690443 6 6 2 858.435326 U \n", "1 2 D male 58.373673 0 0 Z 620.908898 a \n", "2 2 u female 4.117800 2 4 h 193.917948 G \n", "3 1 o male 71.177534 9 7 a 687.914521 Z \n", "4 1 3 male 56.945683 4 10 1 758.999002 W \n", "\n", " Embarked Survived \n", "0 S True \n", "1 C True \n", "2 S True \n", "3 Q True \n", "4 S True " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dummy = client.get_dummy_dataset(\n", " nb_rows = NB_ROWS, \n", " seed = SEED\n", ")\n", "\n", "print(df_dummy.shape)\n", "df_dummy.head()" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "### Query on dummy dataset" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "#### Average and number of rows with smartnoise-sql library on remote dummy" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "# Average Age\n", "QUERY = \"SELECT COUNT(*) AS nb_passengers, \\\n", " AVG(Age) AS avg_age \\\n", " FROM df\"" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "# On the remote server dummy dataframe\n", "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = 100.0, # make sure to select high values of epsilon and delta to have small differences\n", " delta = 2.0, # make sure to select high values of epsilon and delta to have small differences\n", " dummy = True, \n", " nb_rows = NB_ROWS,\n", " seed = SEED\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average age in remote dummy: 51.71 years old.\n", "Number of rows in remote dummy: 199.\n" ] } ], "source": [ "print(f\"Average age in remote dummy: {np.round(dummy_res.result.df['avg_age'][0], 2)} years old.\")\n", "print(f\"Number of rows in remote dummy: {np.round(dummy_res.result.df['nb_passengers'][0], 2)}.\")" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "### Get current budget" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "InitialBudgetResponse(initial_epsilon=45.0, initial_delta=0.2)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_initial_budget()" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpentBudgetResponse(total_spent_epsilon=0.0, total_spent_delta=0.0)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_total_spent_budget()" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "It will also be useful to know what the remaining budget is. Therefore, we call the function `get_remaining_budget`. It just substarcts the total spent budget from the initial budget." ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=45.0, remaining_delta=0.2)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "As expected, for now the remaining budget is equal to the inital budget." ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "### Estimate cost of a query\n", "Another safeguard is the functionnality to estimate the cost of a query. As in OpenDP and SmartnoiseSQL, the budget that will by used by a query might be slightly different than what is asked by the user. The `estimate cost` function returns the estimated real cost of any query.\n", "\n", "Again, of course, this will not impact the user's budget." ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "EPSILON = 0.5\n", "DELTA = 1e-4" ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.5, delta=0.00014999500000001387)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.smartnoise_sql.cost(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA\n", ")" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "### Query on real private dataset with smartnoise-sql." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=45.0, remaining_delta=0.2)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "response = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA,\n", " dummy = False # Optionnal\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of passengers in real data: 887.\n", "Average age in real data: 29.47.\n" ] } ], "source": [ "nb_passengers = response.result.df['nb_passengers'].iloc[0]\n", "print(f\"Number of passengers in real data: {nb_passengers}.\")\n", "\n", "avg_age = np.round(response.result.df['avg_age'].iloc[0], 2)\n", "print(f\"Average age in real data: {avg_age}.\")" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=43.5, remaining_delta=0.199850005)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "As can be seen in `get_total_spent_budget()`, it is the budget estimated with `estimate_smartnoise_sql_cost()` that was spent." ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpentBudgetResponse(total_spent_epsilon=1.5, total_spent_delta=0.00014999500000001387)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_total_spent_budget()" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "## Step 4: Titanic statistics with opendp" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "import opendp as dp\n", "import opendp.transformations as trans\n", "import opendp.measurements as meas" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "### Confidence intervals for age over the whole population" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 887,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'Pclass': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 1,\n", " 'upper': 3},\n", " 'Name': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['male', 'female']},\n", " 'Age': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 0.1,\n", " 'upper': 100.0},\n", " 'SibSp': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 0,\n", " 'upper': 10},\n", " 'Parch': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'int',\n", " 'precision': 32,\n", " 'lower': 0,\n", " 'upper': 10},\n", " 'Ticket': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Fare': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 0.0,\n", " 'upper': 1000.0},\n", " 'Cabin': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string'},\n", " 'Embarked': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['C', 'Q', 'S']},\n", " 'Survived': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'boolean'}}}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic_metadata" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [], "source": [ "columns = [\"PassengerId\", \"Pclass\", \"Name\", \"Sex\", \"Age\", \"SibSp\", \"Parch\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "40", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.1, 100.0)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_min = titanic_metadata['columns']['Age']['lower']\n", "age_max = titanic_metadata['columns']['Age']['upper']\n", "age_min, age_max" ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "age_transformation_pipeline = (\n", " trans.make_split_dataframe(separator=\",\", col_names=columns) >>\n", " trans.make_select_column(key=\"Age\", TOA=str) >>\n", " trans.then_cast_default(TOA=float) >>\n", " trans.then_clamp(bounds=(age_min, age_max)) >>\n", " trans.then_resize(size=nb_passengers.tolist(), constant=avg_age) >>\n", " trans.then_variance()\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "42", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "TypeError", "evalue": "Opendp_pipeline must either of type Measurement or LazyFrame, found ", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mTypeError\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[26]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Expect to fail !!!\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[43mclient\u001b[49m\u001b[43m.\u001b[49m\u001b[43mopendp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mopendp_pipeline\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mage_transformation_pipeline\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[32m 5\u001b[39m \u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/opendp.py:141\u001b[39m, in \u001b[36mOpenDPClient.query\u001b[39m\u001b[34m(self, opendp_pipeline, fixed_delta, mechanism, dummy, nb_rows, seed)\u001b[39m\n\u001b[32m 104\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mquery\u001b[39m(\n\u001b[32m 105\u001b[39m \u001b[38;5;28mself\u001b[39m,\n\u001b[32m 106\u001b[39m opendp_pipeline: dp.Measurement | pl.LazyFrame,\n\u001b[32m (...)\u001b[39m\u001b[32m 111\u001b[39m seed: \u001b[38;5;28mint\u001b[39m = DUMMY_SEED,\n\u001b[32m 112\u001b[39m ) -> QueryResponse | \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 113\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"This function executes an OpenDP query.\u001b[39;00m\n\u001b[32m 114\u001b[39m \n\u001b[32m 115\u001b[39m \u001b[33;03m Args:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 139\u001b[39m \u001b[33;03m containing the deserialized pipeline result.\u001b[39;00m\n\u001b[32m 140\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m141\u001b[39m body_json = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_get_opendp_request_body\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 142\u001b[39m \u001b[43m \u001b[49m\u001b[43mopendp_pipeline\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 143\u001b[39m \u001b[43m \u001b[49m\u001b[43mfixed_delta\u001b[49m\u001b[43m=\u001b[49m\u001b[43mfixed_delta\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 144\u001b[39m \u001b[43m \u001b[49m\u001b[43mmechanism\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmechanism\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 145\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 147\u001b[39m request_model: \u001b[38;5;28mtype\u001b[39m[OpenDPRequestModel]\n\u001b[32m 148\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m dummy:\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/opendp.py:61\u001b[39m, in \u001b[36mOpenDPClient._get_opendp_request_body\u001b[39m\u001b[34m(self, opendp_pipeline, fixed_delta, mechanism)\u001b[39m\n\u001b[32m 59\u001b[39m body_json[\u001b[33m\"\u001b[39m\u001b[33mpipeline_type\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[33m\"\u001b[39m\u001b[33mpolars\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 60\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m61\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\n\u001b[32m 62\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mOpendp_pipeline must either of type Measurement\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 63\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m or LazyFrame, found \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mtype\u001b[39m(opendp_pipeline)\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m\n\u001b[32m 64\u001b[39m )\n\u001b[32m 66\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m body_json\n", "\u001b[31mTypeError\u001b[39m: Opendp_pipeline must either of type Measurement or LazyFrame, found " ] } ], "source": [ "# Expect to fail !!!\n", "client.opendp.query(\n", " opendp_pipeline = age_transformation_pipeline,\n", " dummy=True\n", ")" ] }, { "cell_type": "markdown", "id": "43", "metadata": {}, "source": [ "This is because the server will only allow measurement pipeline with differentially private results. We add Laplacian noise to the pipeline and should be able to instantiate the pipeline." ] }, { "cell_type": "code", "execution_count": null, "id": "44", "metadata": {}, "outputs": [], "source": [ "var_age_transformation_pipeline = (\n", " age_transformation_pipeline >>\n", " meas.then_laplace(scale=5.0)\n", ")" ] }, { "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "Now that there is a measurement, one is able to apply the pipeline on the dummy dataset of the server." ] }, { "cell_type": "code", "execution_count": null, "id": "46", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy result for variance: 56.99\n" ] } ], "source": [ "dummy_var_res = client.opendp.query(\n", " opendp_pipeline = var_age_transformation_pipeline, \n", " dummy=True\n", ")\n", "print(f\"Dummy result for variance: {np.round(dummy_var_res.result.value, 2)}\")" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "With opendp, the function `estimate_opendp_cost` is particularly useful to estimate the used `epsilon` and `delta` based on the `scale` value." ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=2.2502841037292076, delta=0.0)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost_res = client.opendp.cost(\n", " opendp_pipeline = var_age_transformation_pipeline\n", ")\n", "cost_res" ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "One can now execute the query on the real dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "var_res = client.opendp.query(\n", " opendp_pipeline = var_age_transformation_pipeline, \n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "51", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of passengers: 887 (from previous smartnoise-sql query).\n", "Average age: 29.47 (from previous smartnoise-sql query).\n", "Variance of age: 194.603 (from opendp query).\n" ] } ], "source": [ "print(f\"Number of passengers: {nb_passengers} (from previous smartnoise-sql query).\")\n", "\n", "print(f\"Average age: {np.round(avg_age, 2)} (from previous smartnoise-sql query).\")\n", "\n", "var_age = var_res.result.value\n", "print(f\"Variance of age: {np.round(var_age, 3)} (from opendp query).\")" ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standard error of age: 0.47.\n" ] } ], "source": [ "# Get standard error\n", "standard_error = np.sqrt(var_age/nb_passengers)\n", "print(f\"Standard error of age: {np.round(standard_error, 2)}.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 95% confidence interval of the age of all passengers is [28.55, 30.39].\n" ] } ], "source": [ " # Compute the 95% confidence interval\n", "ZSCORE = 1.96\n", "lower_bound = np.round(avg_age - ZSCORE*standard_error, 2)\n", "upper_bound = np.round(avg_age + ZSCORE*standard_error, 2)\n", "print(f\"The 95% confidence interval of the age of all passengers is [{lower_bound}, {upper_bound}].\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }