{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Lomas Client Side: Using Smartnoise-Synth" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "This notebook showcases how researcher could use the Secure Data Disclosure system. It explains the different functionnalities provided by the `lomas-client` client library to interact with the secure server.\n", "\n", "The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.\n", "\n", "Each user has access to one or multiple projects and for each dataset has a limited budget with $\\epsilon$ and $\\delta$ values." ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "# !pip install lomas_client" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "Or using a local version of the client" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "sys.path.append(os.path.abspath(os.path.join('..')))" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from lomas_client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "The client needs a few parameters to be created. Usually, these would be set in the environment by the system administrator (queen Icebergina) and be transparent to lomas users. In this instance, the following code snippet sets a few of these parameters that are specific to this notebook. \n", "\n", "She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "# The following would usually be set in the environment by a system administrator\n", "# and be tranparent to lomas users.\n", "# Uncomment them if you are running against a Kubernetes deployment.\n", "# They have already been set for you if you are running locally within a devenv or the Jupyter lab set up by Docker compose.\n", "\n", "import os\n", "# os.environ[\"LOMAS_CLIENT_APP_URL\"] = \"https://lomas.example.com:443\"\n", "# os.environ[\"LOMAS_CLIENT_KEYCLOAK_URL\"] = \"https://keycloak.example.com:443\"\n", "# os.environ[\"LOMAS_CLIENT_TELEMETRY__ENABLED\"] = \"false\"\n", "# os.environ[\"LOMAS_CLIENT_TELEMETRY__COLLECTOR_ENDPOINT\"] = \"http://otel.example.com:445\"\n", "# os.environ[\"LOMAS_CLIENT_TELEMETRY__COLLECTOR_INSECURE\"] = \"true\"\n", "# os.environ[\"LOMAS_CLIENT_TELEMETRY__SERVICE_ID\"] = \"my-app-client\"\n", "# os.environ[\"LOMAS_CLIENT_REALM\"] = \"lomas\"\n", "\n", "# We set these ones because they are specific to this notebook.\n", "\n", "USER_NAME = \"Dr.Antartica\"\n", "os.environ[\"LOMAS_CLIENT_CLIENT_ID\"] = USER_NAME\n", "os.environ[\"LOMAS_CLIENT_CLIENT_SECRET\"] = USER_NAME.lower()\n", "os.environ[\"LOMAS_CLIENT_DATASET_NAME\"] = \"PENGUIN\"\n", "\n", "# Note that all client settings can also be passed as keyword arguments to the Client constructor.\n", "# eg. client = Client(client_id = \"Dr.Antartica\") takes precedence over setting the \"LOMAS_CLIENT_CLIENT_ID\"\n", "# environment variable." ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "client = Client()" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "And that's it for the preparation. She is now ready to use the various functionnalities offered by `lomas-client`." ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## Step 3: Metadata and dummy dataset" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "### Getting dataset metadata\n", "\n", "Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the `get_dataset_metadata()` function of the client. As this is public information, this does not cost any budget.\n", "\n", "This function returns metadata information in a format based on [SmartnoiseSQL dictionary format](https://docs.smartnoise.org/sql/metadata.html#dictionary-format), where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details). Any metadata is required for Smartnoise-SQL is also required here and additional information such that the different categories in a string type column column can be added." ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 344,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin_metadata = client.get_dataset_metadata()\n", "penguin_metadata" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## Step 3: Create a Synthetic Dataset keeping all default parameters" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "We want to get a synthetic model to represent the private data.\n", "\n", "Therefore, we use a Smartnoise Synth Synthesizers." ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "Let's list the potential options. There respective paramaters are then available in Smarntoise Synth documentation [here](https://docs.smartnoise.org/synth/synthesizers/index.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['mwem', 'dpctgan', 'patectgan', 'mst', 'pacsynth', 'dpgan', 'pategan', 'aim']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from snsynth import Synthesizer\n", "Synthesizer.list_synthesizers()" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "### AIM: Adaptive Iterative Mechanism" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "We start by executing a query on the dummy dataset without specifying any special parameters for AIM (all optional kept as default).\n", "Also only works on categorical columns so we select \"species\" and \"island\" columns to create a synthetic dataset of these two columns." ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0ChinstrapBiscoe
1GentooBiscoe
2GentooDream
3AdelieDream
4GentooBiscoe
.........
195GentooTorgersen
196AdelieBiscoe
197GentooBiscoe
198AdelieTorgersen
199GentooTorgersen
\n", "

200 rows × 2 columns

\n", "
" ], "text/plain": [ " species island\n", "0 Chinstrap Biscoe\n", "1 Gentoo Biscoe\n", "2 Gentoo Dream\n", "3 Adelie Dream\n", "4 Gentoo Biscoe\n", ".. ... ...\n", "195 Gentoo Torgersen\n", "196 Adelie Biscoe\n", "197 Gentoo Biscoe\n", "198 Adelie Torgersen\n", "199 Gentoo Torgersen\n", "\n", "[200 rows x 2 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "The algorithm works and returned a synthetic dataset. We now estimate the cost of running this command:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.0, delta=0.0001)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_cost = client.smartnoise_synth.cost(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", ")\n", "res_cost" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "Executing such a query on the private dataset would cost 1.0 epsilon and 0.0001 delta. Dr. Antartica decides to do it with now the flag `dummmy` to False and specifiying that the wants the aim synthesizer model in return (with `return_model = True`).\n", "\n", "NOTE: if she does not set the parameter `return_model = True`, then it is False by default and she will get a synthetic dataframe as response directly." ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", " return_model = True\n", ")\n", "res.result.model" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "She can now get the model and sample results with it. She choose to sample 10 samples." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0ChinstrapTorgersen
1AdelieTorgersen
2ChinstrapTorgersen
3AdelieDream
4ChinstrapBiscoe
5GentooBiscoe
6GentooDream
7GentooBiscoe
8ChinstrapBiscoe
9ChinstrapTorgersen
\n", "
" ], "text/plain": [ " species island\n", "0 Chinstrap Torgersen\n", "1 Adelie Torgersen\n", "2 Chinstrap Torgersen\n", "3 Adelie Dream\n", "4 Chinstrap Biscoe\n", "5 Gentoo Biscoe\n", "6 Gentoo Dream\n", "7 Gentoo Biscoe\n", "8 Chinstrap Biscoe\n", "9 Chinstrap Torgersen" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = res.result.model\n", "synth.sample(10)" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "She now wants to specify some specific parameters to the AIM model. Therefore, she needs to set some parameters in `synth_params` based on the Smartnoise-Synth documentation [here](https://docs.smartnoise.org/synth/synthesizers/aim.html#parameters). She decides that she wants to modify the `max_model_size` to 50 (the default was 80) and tries on the dummy." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", " return_model = True,\n", " synth_params = {\"max_model_size\": 50}\n", ")\n", "res_dummy.result.model" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0AdelieBiscoe
1GentooTorgersen
2ChinstrapBiscoe
3ChinstrapTorgersen
4GentooDream
\n", "
" ], "text/plain": [ " species island\n", "0 Adelie Biscoe\n", "1 Gentoo Torgersen\n", "2 Chinstrap Biscoe\n", "3 Chinstrap Torgersen\n", "4 Gentoo Dream" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = res_dummy.result.model\n", "synth.sample(5)" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Now that the workflow is understood for AIM, she wants to experiment with various synthesizer on the dummy." ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "### MWEM: Multiplicative Weights Exponential Mechanism " ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "She tries MWEM on all columns with all default parameters. As `return_model` is not specified she will directly receive a synthetic dataframe back. " ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooBiscoe56.2520.5155.06250.0MALE
1GentooBiscoe56.2520.5245.04750.0MALE
2AdelieDream56.2521.5185.03250.0FEMALE
3GentooDream52.7513.5245.05250.0FEMALE
4GentooDream63.2513.5245.03750.0FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Biscoe 56.25 20.5 155.0 \n", "1 Gentoo Biscoe 56.25 20.5 245.0 \n", "2 Adelie Dream 56.25 21.5 185.0 \n", "3 Gentoo Dream 52.75 13.5 245.0 \n", "4 Gentoo Dream 63.25 13.5 245.0 \n", "\n", " body_mass_g sex \n", "0 6250.0 MALE \n", "1 4750.0 MALE \n", "2 3250.0 FEMALE \n", "3 5250.0 FEMALE \n", "4 3750.0 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "She now specifies 3 columns and some parameters explained [here](https://docs.smartnoise.org/synth/synthesizers/mwem.html#snsynth.mwem.MWEMSynthesizer)." ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandsex
0ChinstrapDreamFEMALE
1GentooDreamMALE
2ChinstrapBiscoeFEMALE
3ChinstrapBiscoeFEMALE
4GentooDreamMALE
\n", "
" ], "text/plain": [ " species island sex\n", "0 Chinstrap Dream FEMALE\n", "1 Gentoo Dream MALE\n", "2 Chinstrap Biscoe FEMALE\n", "3 Chinstrap Biscoe FEMALE\n", "4 Gentoo Dream MALE" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"island\", \"sex\"],\n", " synth_params = {\"measure_only\": False, \"max_retries_exp_mechanism\": 5},\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "Finally it MWEM, she wants to go more in depth and create her own data preparation pipeline. Therefore, she can use Smartnoise-Synth \"Data Transformers\" explained [here](https://docs.smartnoise.org/synth/transforms/index.html) and send her own constraints dictionnary for specific steps. This is more for advanced user.\n", "\n", "By default, if no constraints are specified, the server creates its automatically a data transformer based on selected columns, synthesizer and metadata.\n", "\n", "Here she wants to add a clamping transformation on the continuous columns before training the synthesizer. She add the bounds based on metadata." ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0})" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bl_bounds = penguin_metadata[\"columns\"][\"bill_length_mm\"]\n", "bd_bounds = penguin_metadata[\"columns\"][\"bill_depth_mm\"]\n", "bl_bounds, bd_bounds" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "from snsynth.transform import BinTransformer, ClampTransformer, ChainTransformer, LabelTransformer\n", "\n", "my_own_constraints = {\n", " \"bill_length_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " BinTransformer(bins = 20, lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " ]\n", " ),\n", " \"bill_depth_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bd_bounds[\"lower\"] + 2, upper = bd_bounds[\"upper\"] - 2),\n", " BinTransformer(bins=20, lower = bd_bounds[\"lower\"] + 2, upper = bd_bounds[\"upper\"] - 2),\n", " ]\n", " ),\n", " \"species\": LabelTransformer(nullable=True)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmspecies
050.87519.95Chinstrap
144.87520.85Gentoo
241.87516.05Adelie
347.12521.00Gentoo
444.87520.85Gentoo
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm species\n", "0 50.875 19.95 Chinstrap\n", "1 44.875 20.85 Gentoo\n", "2 41.875 16.05 Adelie\n", "3 47.125 21.00 Gentoo\n", "4 44.875 20.85 Gentoo" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"bill_length_mm\", \"bill_depth_mm\", \"species\"],\n", " constraints = my_own_constraints,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "Also a subset of constraints can be specified for certain columns and the server will automatically generate those for the missing columns." ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "my_own_constraints = {\n", " \"bill_length_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " BinTransformer(bins = 20, lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " ]\n", " )\n", "}" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "In this case, only the bill_length will be clamped." ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmspecies
053.12520.5Chinstrap
150.12515.5Adelie
250.12515.5Adelie
350.12515.5Adelie
454.62517.5Gentoo
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm species\n", "0 53.125 20.5 Chinstrap\n", "1 50.125 15.5 Adelie\n", "2 50.125 15.5 Adelie\n", "3 50.125 15.5 Adelie\n", "4 54.625 17.5 Gentoo" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"bill_length_mm\", \"bill_depth_mm\", \"species\"],\n", " constraints = my_own_constraints,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "### MST: Maximum Spanning Tree" ] }, { "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "She now experiments with MST. As the synthesizer is very needy in terms of computation, she selects a subset of column for it. See MST [here](https://docs.smartnoise.org/synth/synthesizers/mst.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "46", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessex
0MALE
1GentooFEMALE
2FEMALE
3AdelieMALE
4GentooFEMALE
\n", "
" ], "text/plain": [ " species sex\n", "0 MALE\n", "1 Gentoo FEMALE\n", "2 FEMALE\n", "3 Adelie MALE\n", "4 Gentoo FEMALE" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"sex\"],\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "She can also specify a specific number of samples to get (if return_model is not True):" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessex
0FEMALE
1MALE
2ChinstrapMALE
3AdelieFEMALE
\n", "
" ], "text/plain": [ " species sex\n", "0 FEMALE\n", "1 MALE\n", "2 Chinstrap MALE\n", "3 Adelie FEMALE" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"sex\"],\n", " nb_samples = 4,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "And a condition on these samples. For instance, here, she only wants female samples." ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexspecies
0Gentoo
1
2Adelie
3Gentoo
\n", "
" ], "text/plain": [ " sex species\n", "0 Gentoo\n", "1 \n", "2 Adelie\n", "3 Gentoo" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"sex\", \"species\"],\n", " nb_samples = 4,\n", " condition = \"sex = FEMALE\",\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## DPCTGAN: Differentially Private Conditional Tabular GAN" ] }, { "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "She now tries DPCTGAN. A first warning let her know that the random noise generation for this model is not cryptographically secure and if it is not ok for her, she can decode to stop using this synthesizer. Then she does not get a response but an error 422 with an explanation." ] }, { "cell_type": "code", "execution_count": null, "id": "53", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:48: UserWarning: Warning:dpctgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "ename": "ExternalLibraryException", "evalue": "(, 'Error fitting model: sample_rate=5.0 is not a valid value. Please provide a float between 0 and 1. Try decreasing batch_size in synth_params (default batch_size=500).')", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mExternalLibraryException\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[24]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m res_dummy = \u001b[43mclient\u001b[49m\u001b[43m.\u001b[49m\u001b[43msmartnoise_synth\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43msynth_name\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mdpctgan\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mepsilon\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m 6\u001b[39m res_dummy\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/smartnoise_synth.py:195\u001b[39m, in \u001b[36mSmartnoiseSynthClient.query\u001b[39m\u001b[34m(self, synth_name, epsilon, delta, select_cols, synth_params, nullable, constraints, dummy, return_model, condition, nb_samples, nb_rows, seed)\u001b[39m\n\u001b[32m 192\u001b[39m body = request_model.model_validate(body_dict)\n\u001b[32m 193\u001b[39m res = \u001b[38;5;28mself\u001b[39m.http_client.post(endpoint, body, SMARTNOISE_SYNTH_READ_TIMEOUT)\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mvalidate_model_response\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mhttp_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mres\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mQueryResponse\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/utils.py:97\u001b[39m, in \u001b[36mvalidate_model_response\u001b[39m\u001b[34m(client, response, response_model)\u001b[39m\n\u001b[32m 95\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m job.status == \u001b[33m\"\u001b[39m\u001b[33mfailed\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 96\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m job.error \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m, \u001b[33m\"\u001b[39m\u001b[33mjob \u001b[39m\u001b[38;5;132;01m{job_uid}\u001b[39;00m\u001b[33m failed without error !\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m---> \u001b[39m\u001b[32m97\u001b[39m \u001b[43mraise_error_from_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mjob\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 99\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response_model.model_validate(job.result)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/core/lomas_core/error_handler.py:150\u001b[39m, in \u001b[36mraise_error_from_model\u001b[39m\u001b[34m(error_model)\u001b[39m\n\u001b[32m 148\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_model.message)\n\u001b[32m 149\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m ExternalLibraryExceptionModel():\n\u001b[32m--> \u001b[39m\u001b[32m150\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(error_model.library, error_model.message)\n\u001b[32m 151\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m UnauthorizedAccessExceptionModel():\n\u001b[32m 152\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_model.message)\n", "\u001b[31mExternalLibraryException\u001b[39m: (, 'Error fitting model: sample_rate=5.0 is not a valid value. Please provide a float between 0 and 1. Try decreasing batch_size in synth_params (default batch_size=500).')" ] } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpctgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy" ] }, { "cell_type": "markdown", "id": "54", "metadata": {}, "source": [ "The default parameters of DPCTGAN do not work for PENGUIN dataset. Hence, as advised in the error message, she decreases the batch_size (also she checks the documentation [here](https://docs.smartnoise.org/synth/synthesizers/dpctgan.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "55", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieBiscoe45.10619016.716415231.2200164253.058255MALE
1ChinstrapTorgersen48.93280117.334574202.0852134730.876580MALE
2ChinstrapTorgersen45.39089415.489699198.9729544027.705349FEMALE
3ChinstrapDream56.00323916.340220210.3316593981.057748MALE
4AdelieTorgersen41.85495215.144781215.5355023810.137480FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Biscoe 45.106190 16.716415 231.220016 \n", "1 Chinstrap Torgersen 48.932801 17.334574 202.085213 \n", "2 Chinstrap Torgersen 45.390894 15.489699 198.972954 \n", "3 Chinstrap Dream 56.003239 16.340220 210.331659 \n", "4 Adelie Torgersen 41.854952 15.144781 215.535502 \n", "\n", " body_mass_g sex \n", "0 4253.058255 MALE \n", "1 4730.876580 MALE \n", "2 4027.705349 FEMALE \n", "3 3981.057748 MALE \n", "4 3810.137480 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpctgan\",\n", " epsilon=1.0,\n", " synth_params = {\"batch_size\": 50},\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "## PATEGAN: Private Aggregation of Teacher Ensembles" ] }, { "cell_type": "markdown", "id": "57", "metadata": {}, "source": [ "Unfortunatelly, she is not able to train the pategan synthetizer on the PENGUIN dataset. Hence, she must try another one." ] }, { "cell_type": "code", "execution_count": null, "id": "58", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "ExternalLibraryException", "evalue": "(, 'pategan not reliable with this dataset.')", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mExternalLibraryException\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[26]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m res_dummy = \u001b[43mclient\u001b[49m\u001b[43m.\u001b[49m\u001b[43msmartnoise_synth\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43msynth_name\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mpategan\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mepsilon\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m 6\u001b[39m res_dummy\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/smartnoise_synth.py:195\u001b[39m, in \u001b[36mSmartnoiseSynthClient.query\u001b[39m\u001b[34m(self, synth_name, epsilon, delta, select_cols, synth_params, nullable, constraints, dummy, return_model, condition, nb_samples, nb_rows, seed)\u001b[39m\n\u001b[32m 192\u001b[39m body = request_model.model_validate(body_dict)\n\u001b[32m 193\u001b[39m res = \u001b[38;5;28mself\u001b[39m.http_client.post(endpoint, body, SMARTNOISE_SYNTH_READ_TIMEOUT)\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mvalidate_model_response\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mhttp_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mres\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mQueryResponse\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/utils.py:97\u001b[39m, in \u001b[36mvalidate_model_response\u001b[39m\u001b[34m(client, response, response_model)\u001b[39m\n\u001b[32m 95\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m job.status == \u001b[33m\"\u001b[39m\u001b[33mfailed\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 96\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m job.error \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m, \u001b[33m\"\u001b[39m\u001b[33mjob \u001b[39m\u001b[38;5;132;01m{job_uid}\u001b[39;00m\u001b[33m failed without error !\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m---> \u001b[39m\u001b[32m97\u001b[39m \u001b[43mraise_error_from_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mjob\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 99\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response_model.model_validate(job.result)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/core/lomas_core/error_handler.py:150\u001b[39m, in \u001b[36mraise_error_from_model\u001b[39m\u001b[34m(error_model)\u001b[39m\n\u001b[32m 148\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_model.message)\n\u001b[32m 149\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m ExternalLibraryExceptionModel():\n\u001b[32m--> \u001b[39m\u001b[32m150\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(error_model.library, error_model.message)\n\u001b[32m 151\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m UnauthorizedAccessExceptionModel():\n\u001b[32m 152\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_model.message)\n", "\u001b[31mExternalLibraryException\u001b[39m: (, 'pategan not reliable with this dataset.')" ] } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"pategan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy" ] }, { "cell_type": "markdown", "id": "59", "metadata": {}, "source": [ "## PATECTGAN: Conditional tabular GAN using Private Aggregation of Teacher Ensembles" ] }, { "cell_type": "code", "execution_count": null, "id": "60", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieBiscoe37.57665516.970317206.3505634852.220871MALE
1ChinstrapBiscoe41.74362518.780999206.8318435129.978105MALE
2ChinstrapBiscoe47.64148718.473230227.5581693462.845579MALE
3GentooDream54.31441418.642316225.6579283326.226145FEMALE
4GentooTorgersen46.69429518.423236195.6390255145.398423FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Biscoe 37.576655 16.970317 206.350563 \n", "1 Chinstrap Biscoe 41.743625 18.780999 206.831843 \n", "2 Chinstrap Biscoe 47.641487 18.473230 227.558169 \n", "3 Gentoo Dream 54.314414 18.642316 225.657928 \n", "4 Gentoo Torgersen 46.694295 18.423236 195.639025 \n", "\n", " body_mass_g sex \n", "0 4852.220871 MALE \n", "1 5129.978105 MALE \n", "2 3462.845579 MALE \n", "3 3326.226145 FEMALE \n", "4 5145.398423 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "61", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandbill_length_mmbody_mass_g
0Torgersen62.2825263478.341073
1Biscoe59.7208042531.271100
2Biscoe46.1836805444.812819
3Torgersen54.4612372595.776290
4Dream41.0822724234.085873
\n", "
" ], "text/plain": [ " island bill_length_mm body_mass_g\n", "0 Torgersen 62.282526 3478.341073\n", "1 Biscoe 59.720804 2531.271100\n", "2 Biscoe 46.183680 5444.812819\n", "3 Torgersen 54.461237 2595.776290\n", "4 Dream 41.082272 4234.085873" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " epsilon=1.0,\n", " select_cols = [\"island\", \"bill_length_mm\", \"body_mass_g\"],\n", " synth_params = {\n", " \"embedding_dim\": 256, \n", " \"generator_dim\": (128, 128), \n", " \"discriminator_dim\": (256, 256),\n", " \"generator_lr\": 0.0003, \n", " \"generator_decay\": 1e-05, \n", " \"discriminator_lr\": 0.0003, \n", " \"discriminator_decay\": 1e-05, \n", " \"batch_size\": 500\n", " },\n", " nb_samples = 100,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "62", "metadata": {}, "source": [ "## DPGAN: DIfferentially Private GAN" ] }, { "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "For DPGAN, there is the same warning as for DPCTGAN with the cryptographically secure random noise generation." ] }, { "cell_type": "code", "execution_count": null, "id": "64", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:48: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooDream59.40809321.774501182.8584333574.388221FEMALE
1GentooBiscoe45.65373718.811784197.7557543584.595516FEMALE
2GentooDream46.93570922.695824184.0802924085.711025FEMALE
3GentooDream47.61337520.382118192.0399803633.892506FEMALE
4GentooTorgersen47.48634621.495789244.9852383500.759944FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Dream 59.408093 21.774501 182.858433 \n", "1 Gentoo Biscoe 45.653737 18.811784 197.755754 \n", "2 Gentoo Dream 46.935709 22.695824 184.080292 \n", "3 Gentoo Dream 47.613375 20.382118 192.039980 \n", "4 Gentoo Torgersen 47.486346 21.495789 244.985238 \n", "\n", " body_mass_g sex \n", "0 3574.388221 FEMALE \n", "1 3584.595516 FEMALE \n", "2 4085.711025 FEMALE \n", "3 3633.892506 FEMALE \n", "4 3500.759944 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "65", "metadata": {}, "source": [ "One final time she samples with conditions:" ] }, { "cell_type": "code", "execution_count": null, "id": "66", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:48: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieBiscoe64.59228717.889545196.4303115547.378704FEMALE
1GentooTorgersen56.61077717.608110198.2951145344.676420MALE
2AdelieBiscoe47.45322317.926415246.2103756746.744037MALE
3ChinstrapBiscoe58.20697517.540024191.4670186017.837495MALE
4ChinstrapBiscoe47.60677721.512008188.2924216610.772133MALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Biscoe 64.592287 17.889545 196.430311 \n", "1 Gentoo Torgersen 56.610777 17.608110 198.295114 \n", "2 Adelie Biscoe 47.453223 17.926415 246.210375 \n", "3 Chinstrap Biscoe 58.206975 17.540024 191.467018 \n", "4 Chinstrap Biscoe 47.606777 21.512008 188.292421 \n", "\n", " body_mass_g sex \n", "0 5547.378704 FEMALE \n", "1 5344.676420 MALE \n", "2 6746.744037 MALE \n", "3 6017.837495 MALE \n", "4 6610.772133 MALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " condition = \"body_mass_g > 5000\",\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "67", "metadata": {}, "source": [ "And now on the real dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "68", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:48: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooBiscoe65.00000017.650500250.0000005846.368641FEMALE
1GentooBiscoe46.75503317.386022243.7623236292.863309FEMALE
2AdelieBiscoe65.00000019.964333234.8817476435.244948MALE
3GentooBiscoe65.00000016.515368229.1681625154.040873FEMALE
4ChinstrapBiscoe65.00000017.283090250.0000006809.538275MALE
5GentooBiscoe61.37303017.146575229.2272426436.501563FEMALE
6GentooTorgersen49.68081419.886045218.0706256159.562886MALE
7AdelieTorgersen52.84858517.673031203.9137797000.000000MALE
8GentooBiscoe46.31144423.000000241.7939995256.193101FEMALE
9GentooBiscoe55.13201317.231155233.9415436587.419331MALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Biscoe 65.000000 17.650500 250.000000 \n", "1 Gentoo Biscoe 46.755033 17.386022 243.762323 \n", "2 Adelie Biscoe 65.000000 19.964333 234.881747 \n", "3 Gentoo Biscoe 65.000000 16.515368 229.168162 \n", "4 Chinstrap Biscoe 65.000000 17.283090 250.000000 \n", "5 Gentoo Biscoe 61.373030 17.146575 229.227242 \n", "6 Gentoo Torgersen 49.680814 19.886045 218.070625 \n", "7 Adelie Torgersen 52.848585 17.673031 203.913779 \n", "8 Gentoo Biscoe 46.311444 23.000000 241.793999 \n", "9 Gentoo Biscoe 55.132013 17.231155 233.941543 \n", "\n", " body_mass_g sex \n", "0 5846.368641 FEMALE \n", "1 6292.863309 FEMALE \n", "2 6435.244948 MALE \n", "3 5154.040873 FEMALE \n", "4 6809.538275 MALE \n", "5 6436.501563 FEMALE \n", "6 6159.562886 MALE \n", "7 7000.000000 MALE \n", "8 5256.193101 FEMALE \n", "9 6587.419331 MALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " condition = \"body_mass_g > 5000\",\n", " nb_samples = 10,\n", " dummy=False,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "## Step 6: See archives of queries" ] }, { "cell_type": "markdown", "id": "70", "metadata": {}, "source": [ "She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses." ] }, { "cell_type": "code", "execution_count": null, "id": "71", "metadata": {}, "outputs": [], "source": [ "previous_queries = client.get_previous_queries()" ] }, { "cell_type": "markdown", "id": "72", "metadata": {}, "source": [ "Let's check the last query" ] }, { "cell_type": "code", "execution_count": null, "id": "73", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr.Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_library': 'smartnoise_synth',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'synth_name': 'dpgan',\n", " 'epsilon': 1.0,\n", " 'delta': None,\n", " 'select_cols': [],\n", " 'synth_params': {},\n", " 'nullable': True,\n", " 'constraints': '',\n", " 'return_model': False,\n", " 'condition': 'body_mass_g > 5000',\n", " 'nb_samples': 10},\n", " 'response': {'epsilon': 1.0,\n", " 'delta': 0.00015673368198174188,\n", " 'requested_by': 'Dr.Antartica',\n", " 'result': res_type \\\n", " index sn_synth_samples \n", " columns sn_synth_samples \n", " data sn_synth_samples \n", " index_names sn_synth_samples \n", " column_names sn_synth_samples \n", " \n", " df_samples \n", " index [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] \n", " columns [species, island, bill_length_mm, bill_depth_m... \n", " data [[Gentoo, Biscoe, 65.0, 17.650499559938908, 25... \n", " index_names [None] \n", " column_names [None] },\n", " 'timestamp': 1747224709.1297455}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "last_query = previous_queries[-1]\n", "last_query" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }