{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Lomas Client Side: Using Smartnoise-Synth" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "This notebook showcases how researcher could use the Secure Data Disclosure system. It explains the different functionnalities provided by the `lomas-client` client library to interact with the secure server.\n", "\n", "The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.\n", "\n", "Each user has access to one or multiple projects and for each dataset has a limited budget with $\\epsilon$ and $\\delta$ values." ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "# !pip install lomas_client" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "Or using a local version of the client" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "sys.path.append(os.path.abspath(os.path.join('..')))" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from lomas_client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "The client needs a few parameters to be created. Usually, these would be set in the environment by the system administrator (queen Icebergina) and be transparent to lomas users. In this instance, the following code snippet sets a few of these parameters that are specific to this notebook. \n", "\n", "She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "# The following would usually be set in the environment by a system administrator\n", "# and be tranparent to lomas users. We reset these ones because they are specific to this notebook.\n", "\n", "# Note that all client settings can also be passed as keyword arguments to the Client constructor.\n", "# eg. client = Client(client_id = \"Dr.Antartica\") takes precedence over setting the \"LOMAS_CLIENT_CLIENT_ID\"\n", "# environment variable.\n", "\n", "import os\n", "\n", "USER_NAME = \"Dr.Antartica\"\n", "os.environ[\"LOMAS_CLIENT_CLIENT_ID\"] = USER_NAME\n", "os.environ[\"LOMAS_CLIENT_CLIENT_SECRET\"] = USER_NAME.lower()\n", "os.environ[\"LOMAS_CLIENT_DATASET_NAME\"] = \"PENGUIN\"" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'Client' is not defined", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m client = \u001b[43mClient\u001b[49m()\n", "\u001b[31mNameError\u001b[39m: name 'Client' is not defined" ] } ], "source": [ "client = Client()" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "And that's it for the preparation. She is now ready to use the various functionnalities offered by `lomas-client`." ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## Step 3: Metadata and dummy dataset" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "### Getting dataset metadata\n", "\n", "Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the `get_dataset_metadata()` function of the client. As this is public information, this does not cost any budget.\n", "\n", "This function returns metadata information in a format based on [SmartnoiseSQL dictionary format](https://docs.smartnoise.org/sql/metadata.html#dictionary-format), where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details). Any metadata is required for Smartnoise-SQL is also required here and additional information such that the different categories in a string type column column can be added." ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 344,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin_metadata = client.get_dataset_metadata()\n", "penguin_metadata" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## Step 3: Create a Synthetic Dataset keeping all default parameters" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "We want to get a synthetic model to represent the private data.\n", "\n", "Therefore, we use a Smartnoise Synth Synthesizers." ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "Let's list the potential options. There respective paramaters are then available in Smarntoise Synth documentation [here](https://docs.smartnoise.org/synth/synthesizers/index.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['mwem', 'dpctgan', 'patectgan', 'mst', 'pacsynth', 'dpgan', 'pategan', 'aim']" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from snsynth import Synthesizer\n", "Synthesizer.list_synthesizers()" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "### AIM: Adaptive Iterative Mechanism" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "We start by executing a query on the dummy dataset without specifying any special parameters for AIM (all optional kept as default).\n", "Also only works on categorical columns so we select \"species\" and \"island\" columns to create a synthetic dataset of these two columns." ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0AdelieDream
1ChinstrapTorgersen
2ChinstrapBiscoe
3ChinstrapBiscoe
4Torgersen
.........
195AdelieTorgersen
196ChinstrapBiscoe
197AdelieTorgersen
198ChinstrapBiscoe
199GentooDream
\n", "

200 rows × 2 columns

\n", "
" ], "text/plain": [ " species island\n", "0 Adelie Dream\n", "1 Chinstrap Torgersen\n", "2 Chinstrap Biscoe\n", "3 Chinstrap Biscoe\n", "4 Torgersen\n", ".. ... ...\n", "195 Adelie Torgersen\n", "196 Chinstrap Biscoe\n", "197 Adelie Torgersen\n", "198 Chinstrap Biscoe\n", "199 Gentoo Dream\n", "\n", "[200 rows x 2 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "The algorithm works and returned a synthetic dataset. We now estimate the cost of running this command:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.0, delta=0.0001)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_cost = client.smartnoise_synth.cost(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", ")\n", "res_cost" ] }, { "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "Executing such a query on the private dataset would cost 1.0 epsilon and 0.0001 delta. Dr. Antartica decides to do it with now the flag `dummmy` to False and specifiying that the wants the aim synthesizer model in return (with `return_model = True`).\n", "\n", "NOTE: if she does not set the parameter `return_model = True`, then it is False by default and she will get a synthetic dataframe as response directly." ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", " return_model = True\n", ")\n", "res.result.model" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "She can now get the model and sample results with it. She choose to sample 10 samples." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0GentooBiscoe
1ChinstrapTorgersen
2GentooTorgersen
3ChinstrapTorgersen
4GentooBiscoe
5AdelieDream
6AdelieBiscoe
7ChinstrapBiscoe
8ChinstrapDream
9GentooTorgersen
\n", "
" ], "text/plain": [ " species island\n", "0 Gentoo Biscoe\n", "1 Chinstrap Torgersen\n", "2 Gentoo Torgersen\n", "3 Chinstrap Torgersen\n", "4 Gentoo Biscoe\n", "5 Adelie Dream\n", "6 Adelie Biscoe\n", "7 Chinstrap Biscoe\n", "8 Chinstrap Dream\n", "9 Gentoo Torgersen" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = res.result.model\n", "synth.sample(10)" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "She now wants to specify some specific parameters to the AIM model. Therefore, she needs to set some parameters in `synth_params` based on the Smartnoise-Synth documentation [here](https://docs.smartnoise.org/synth/synthesizers/aim.html#parameters). She decides that she wants to modify the `max_model_size` to 50 (the default was 80) and tries on the dummy." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"aim\",\n", " epsilon=1.0,\n", " delta=0.0001,\n", " select_cols = [\"species\", \"island\"],\n", " dummy=True,\n", " return_model = True,\n", " synth_params = {\"max_model_size\": 50}\n", ")\n", "res_dummy.result.model" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesisland
0GentooBiscoe
1GentooDream
2ChinstrapBiscoe
3ChinstrapTorgersen
4AdelieDream
\n", "
" ], "text/plain": [ " species island\n", "0 Gentoo Biscoe\n", "1 Gentoo Dream\n", "2 Chinstrap Biscoe\n", "3 Chinstrap Torgersen\n", "4 Adelie Dream" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = res_dummy.result.model\n", "synth.sample(5)" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "Now that the workflow is understood for AIM, she wants to experiment with various synthesizer on the dummy." ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "### MWEM: Multiplicative Weights Exponential Mechanism " ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "She tries MWEM on all columns with all default parameters. As `return_model` is not specified she will directly receive a synthetic dataframe back. " ] }, { "cell_type": "code", "execution_count": null, "id": "33", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooDream49.2518.5185.02750.0FEMALE
1GentooDream45.7518.5185.02750.0FEMALE
2AdelieBiscoe49.2513.5155.02250.0MALE
3GentooDream49.2518.5185.02750.0FEMALE
4AdelieBiscoe49.2513.5155.02250.0MALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Dream 49.25 18.5 185.0 \n", "1 Gentoo Dream 45.75 18.5 185.0 \n", "2 Adelie Biscoe 49.25 13.5 155.0 \n", "3 Gentoo Dream 49.25 18.5 185.0 \n", "4 Adelie Biscoe 49.25 13.5 155.0 \n", "\n", " body_mass_g sex \n", "0 2750.0 FEMALE \n", "1 2750.0 FEMALE \n", "2 2250.0 MALE \n", "3 2750.0 FEMALE \n", "4 2250.0 MALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "She now specifies 3 columns and some parameters explained [here](https://docs.smartnoise.org/synth/synthesizers/mwem.html#snsynth.mwem.MWEMSynthesizer)." ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandsex
0GentooBiscoeMALE
1ChinstrapDreamMALE
2ChinstrapDreamMALE
3ChinstrapDreamMALE
4GentooTorgersenFEMALE
\n", "
" ], "text/plain": [ " species island sex\n", "0 Gentoo Biscoe MALE\n", "1 Chinstrap Dream MALE\n", "2 Chinstrap Dream MALE\n", "3 Chinstrap Dream MALE\n", "4 Gentoo Torgersen FEMALE" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"island\", \"sex\"],\n", " synth_params = {\"measure_only\": False, \"max_retries_exp_mechanism\": 5},\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "Finally it MWEM, she wants to go more in depth and create her own data preparation pipeline. Therefore, she can use Smartnoise-Synth \"Data Transformers\" explained [here](https://docs.smartnoise.org/synth/transforms/index.html) and send her own constraints dictionnary for specific steps. This is more for advanced user.\n", "\n", "By default, if no constraints are specified, the server creates its automatically a data transformer based on selected columns, synthesizer and metadata.\n", "\n", "Here she wants to add a clamping transformation on the continuous columns before training the synthesizer. She add the bounds based on metadata." ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0})" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bl_bounds = penguin_metadata[\"columns\"][\"bill_length_mm\"]\n", "bd_bounds = penguin_metadata[\"columns\"][\"bill_depth_mm\"]\n", "bl_bounds, bd_bounds" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "from snsynth.transform import BinTransformer, ClampTransformer, ChainTransformer, LabelTransformer\n", "\n", "my_own_constraints = {\n", " \"bill_length_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " BinTransformer(bins = 20, lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " ]\n", " ),\n", " \"bill_depth_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bd_bounds[\"lower\"] + 2, upper = bd_bounds[\"upper\"] - 2),\n", " BinTransformer(bins=20, lower = bd_bounds[\"lower\"] + 2, upper = bd_bounds[\"upper\"] - 2),\n", " ]\n", " ),\n", " \"species\": LabelTransformer(nullable=True)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "39", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmspecies
046.37517.55Gentoo
150.87515.75Chinstrap
241.87515.15Chinstrap
341.87515.15Adelie
446.37517.55Gentoo
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm species\n", "0 46.375 17.55 Gentoo\n", "1 50.875 15.75 Chinstrap\n", "2 41.875 15.15 Chinstrap\n", "3 41.875 15.15 Adelie\n", "4 46.375 17.55 Gentoo" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"bill_length_mm\", \"bill_depth_mm\", \"species\"],\n", " constraints = my_own_constraints,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "Also a subset of constraints can be specified for certain columns and the server will automatically generate those for the missing columns." ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "my_own_constraints = {\n", " \"bill_length_mm\": ChainTransformer(\n", " [\n", " ClampTransformer(lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " BinTransformer(bins = 20, lower = bl_bounds[\"lower\"] + 10, upper = bl_bounds[\"upper\"] - 10),\n", " ]\n", " )\n", "}" ] }, { "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "In this case, only the bill_length will be clamped." ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmspecies
049.37515.5Gentoo
145.62522.5Adelie
246.37520.5Adelie
353.12517.5Chinstrap
454.62514.5Adelie
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm species\n", "0 49.375 15.5 Gentoo\n", "1 45.625 22.5 Adelie\n", "2 46.375 20.5 Adelie\n", "3 53.125 17.5 Chinstrap\n", "4 54.625 14.5 Adelie" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mwem\",\n", " epsilon=1.0,\n", " select_cols = [\"bill_length_mm\", \"bill_depth_mm\", \"species\"],\n", " constraints = my_own_constraints,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "44", "metadata": {}, "source": [ "### MST: Maximum Spanning Tree" ] }, { "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "She now experiments with MST. As the synthesizer is very needy in terms of computation, she selects a subset of column for it. See MST [here](https://docs.smartnoise.org/synth/synthesizers/mst.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "46", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessex
0GentooFEMALE
1MALE
2ChinstrapMALE
3AdelieFEMALE
4
\n", "
" ], "text/plain": [ " species sex\n", "0 Gentoo FEMALE\n", "1 MALE\n", "2 Chinstrap MALE\n", "3 Adelie FEMALE\n", "4 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"sex\"],\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "47", "metadata": {}, "source": [ "She can also specify a specific number of samples to get (if return_model is not True):" ] }, { "cell_type": "code", "execution_count": null, "id": "48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessex
0Chinstrap
1Gentoo
2MALE
3FEMALE
\n", "
" ], "text/plain": [ " species sex\n", "0 Chinstrap \n", "1 Gentoo \n", "2 MALE\n", "3 FEMALE" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"species\", \"sex\"],\n", " nb_samples = 4,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "49", "metadata": {}, "source": [ "And a condition on these samples. For instance, here, she only wants female samples." ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexspecies
0Gentoo
1Chinstrap
2Gentoo
3Gentoo
\n", "
" ], "text/plain": [ " sex species\n", "0 Gentoo\n", "1 Chinstrap\n", "2 Gentoo\n", "3 Gentoo" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"mst\",\n", " epsilon=1.0,\n", " select_cols = [\"sex\", \"species\"],\n", " nb_samples = 4,\n", " condition = \"sex = FEMALE\",\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "## DPCTGAN: Differentially Private Conditional Tabular GAN" ] }, { "cell_type": "markdown", "id": "52", "metadata": {}, "source": [ "She now tries DPCTGAN. A first warning let her know that the random noise generation for this model is not cryptographically secure and if it is not ok for her, she can decode to stop using this synthesizer. Then she does not get a response but an error 422 with an explanation." ] }, { "cell_type": "code", "execution_count": null, "id": "53", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:44: UserWarning: Warning:dpctgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "ename": "ExternalLibraryException", "evalue": "(, 'Error fitting model: sample_rate=5.0 is not a valid value. Please provide a float between 0 and 1. Try decreasing batch_size in synth_params (default batch_size=500).')", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mExternalLibraryException\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[25]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m res_dummy = \u001b[43mclient\u001b[49m\u001b[43m.\u001b[49m\u001b[43msmartnoise_synth\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43msynth_name\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mdpctgan\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mepsilon\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m 6\u001b[39m res_dummy\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/smartnoise_synth.py:195\u001b[39m, in \u001b[36mSmartnoiseSynthClient.query\u001b[39m\u001b[34m(self, synth_name, epsilon, delta, select_cols, synth_params, nullable, constraints, dummy, return_model, condition, nb_samples, nb_rows, seed)\u001b[39m\n\u001b[32m 192\u001b[39m body = request_model.model_validate(body_dict)\n\u001b[32m 193\u001b[39m res = \u001b[38;5;28mself\u001b[39m.http_client.post(endpoint, body, SMARTNOISE_SYNTH_READ_TIMEOUT)\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mvalidate_model_response\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mhttp_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mres\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mQueryResponse\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/utils.py:93\u001b[39m, in \u001b[36mvalidate_model_response\u001b[39m\u001b[34m(client, response, response_model)\u001b[39m\n\u001b[32m 91\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m job.status == \u001b[33m\"\u001b[39m\u001b[33mfailed\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 92\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m job.error \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m, \u001b[33m\"\u001b[39m\u001b[33mjob \u001b[39m\u001b[38;5;132;01m{job_uid}\u001b[39;00m\u001b[33m failed without error !\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m---> \u001b[39m\u001b[32m93\u001b[39m \u001b[43mraise_error_from_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mjob\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 95\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response_model.model_validate(job.result)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/core/lomas_core/error_handler.py:150\u001b[39m, in \u001b[36mraise_error_from_model\u001b[39m\u001b[34m(error_model)\u001b[39m\n\u001b[32m 148\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_model.message)\n\u001b[32m 149\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m ExternalLibraryExceptionModel():\n\u001b[32m--> \u001b[39m\u001b[32m150\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(error_model.library, error_model.message)\n\u001b[32m 151\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m UnauthorizedAccessExceptionModel():\n\u001b[32m 152\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_model.message)\n", "\u001b[31mExternalLibraryException\u001b[39m: (, 'Error fitting model: sample_rate=5.0 is not a valid value. Please provide a float between 0 and 1. Try decreasing batch_size in synth_params (default batch_size=500).')" ] } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpctgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy" ] }, { "cell_type": "markdown", "id": "54", "metadata": {}, "source": [ "The default parameters of DPCTGAN do not work for PENGUIN dataset. Hence, as advised in the error message, she decreases the batch_size (also she checks the documentation [here](https://docs.smartnoise.org/synth/synthesizers/dpctgan.html)." ] }, { "cell_type": "code", "execution_count": null, "id": "55", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen45.83334716.692103194.0826653149.535030FEMALE
1ChinstrapBiscoe53.73272418.273553177.0042335117.040396FEMALE
2AdelieTorgersen49.11581916.810560219.6997215106.081523FEMALE
3AdelieBiscoe42.52234116.397532201.2151745495.932743MALE
4AdelieTorgersen39.65427416.744885228.3130264522.405903FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 45.833347 16.692103 194.082665 \n", "1 Chinstrap Biscoe 53.732724 18.273553 177.004233 \n", "2 Adelie Torgersen 49.115819 16.810560 219.699721 \n", "3 Adelie Biscoe 42.522341 16.397532 201.215174 \n", "4 Adelie Torgersen 39.654274 16.744885 228.313026 \n", "\n", " body_mass_g sex \n", "0 3149.535030 FEMALE \n", "1 5117.040396 FEMALE \n", "2 5106.081523 FEMALE \n", "3 5495.932743 MALE \n", "4 4522.405903 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpctgan\",\n", " epsilon=1.0,\n", " synth_params = {\"batch_size\": 50},\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "56", "metadata": {}, "source": [ "## PATEGAN: Private Aggregation of Teacher Ensembles" ] }, { "cell_type": "markdown", "id": "57", "metadata": {}, "source": [ "Unfortunatelly, she is not able to train the pategan synthetizer on the PENGUIN dataset. Hence, she must try another one." ] }, { "cell_type": "code", "execution_count": null, "id": "58", "metadata": { "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "ExternalLibraryException", "evalue": "(, 'pategan not reliable with this dataset.')", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mExternalLibraryException\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[27]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m res_dummy = \u001b[43mclient\u001b[49m\u001b[43m.\u001b[49m\u001b[43msmartnoise_synth\u001b[49m\u001b[43m.\u001b[49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2\u001b[39m \u001b[43m \u001b[49m\u001b[43msynth_name\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mpategan\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 3\u001b[39m \u001b[43m \u001b[49m\u001b[43mepsilon\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 5\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m 6\u001b[39m res_dummy\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/libraries/smartnoise_synth.py:195\u001b[39m, in \u001b[36mSmartnoiseSynthClient.query\u001b[39m\u001b[34m(self, synth_name, epsilon, delta, select_cols, synth_params, nullable, constraints, dummy, return_model, condition, nb_samples, nb_rows, seed)\u001b[39m\n\u001b[32m 192\u001b[39m body = request_model.model_validate(body_dict)\n\u001b[32m 193\u001b[39m res = \u001b[38;5;28mself\u001b[39m.http_client.post(endpoint, body, SMARTNOISE_SYNTH_READ_TIMEOUT)\n\u001b[32m--> \u001b[39m\u001b[32m195\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mvalidate_model_response\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mhttp_client\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mres\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mQueryResponse\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/client/lomas_client/utils.py:93\u001b[39m, in \u001b[36mvalidate_model_response\u001b[39m\u001b[34m(client, response, response_model)\u001b[39m\n\u001b[32m 91\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m job.status == \u001b[33m\"\u001b[39m\u001b[33mfailed\u001b[39m\u001b[33m\"\u001b[39m:\n\u001b[32m 92\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m job.error \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m, \u001b[33m\"\u001b[39m\u001b[33mjob \u001b[39m\u001b[38;5;132;01m{job_uid}\u001b[39;00m\u001b[33m failed without error !\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m---> \u001b[39m\u001b[32m93\u001b[39m \u001b[43mraise_error_from_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mjob\u001b[49m\u001b[43m.\u001b[49m\u001b[43merror\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 95\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m response_model.model_validate(job.result)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/work/sdd-poc-server/core/lomas_core/error_handler.py:150\u001b[39m, in \u001b[36mraise_error_from_model\u001b[39m\u001b[34m(error_model)\u001b[39m\n\u001b[32m 148\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_model.message)\n\u001b[32m 149\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m ExternalLibraryExceptionModel():\n\u001b[32m--> \u001b[39m\u001b[32m150\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(error_model.library, error_model.message)\n\u001b[32m 151\u001b[39m \u001b[38;5;28;01mcase\u001b[39;00m UnauthorizedAccessExceptionModel():\n\u001b[32m 152\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_model.message)\n", "\u001b[31mExternalLibraryException\u001b[39m: (, 'pategan not reliable with this dataset.')" ] } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"pategan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy" ] }, { "cell_type": "markdown", "id": "59", "metadata": {}, "source": [ "## PATECTGAN: Conditional tabular GAN using Private Aggregation of Teacher Ensembles" ] }, { "cell_type": "code", "execution_count": null, "id": "60", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen40.00747314.863616177.7717134367.781503MALE
1ChinstrapBiscoe47.79965518.101346182.2339094781.415079MALE
2ChinstrapBiscoe41.79568716.121351193.2191103124.987453MALE
3GentooDream41.40859621.911954180.6903484655.957984FEMALE
4GentooBiscoe41.82524017.597221190.1283092562.520325FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Torgersen 40.007473 14.863616 177.771713 \n", "1 Chinstrap Biscoe 47.799655 18.101346 182.233909 \n", "2 Chinstrap Biscoe 41.795687 16.121351 193.219110 \n", "3 Gentoo Dream 41.408596 21.911954 180.690348 \n", "4 Gentoo Biscoe 41.825240 17.597221 190.128309 \n", "\n", " body_mass_g sex \n", "0 4367.781503 MALE \n", "1 4781.415079 MALE \n", "2 3124.987453 MALE \n", "3 4655.957984 FEMALE \n", "4 2562.520325 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "61", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandbill_length_mmbody_mass_g
0Dream62.1841633563.350335
1Biscoe58.6934412519.153178
2Biscoe45.2447345277.579844
3Torgersen53.0867222477.480292
4Dream39.5863844253.510337
\n", "
" ], "text/plain": [ " island bill_length_mm body_mass_g\n", "0 Dream 62.184163 3563.350335\n", "1 Biscoe 58.693441 2519.153178\n", "2 Biscoe 45.244734 5277.579844\n", "3 Torgersen 53.086722 2477.480292\n", "4 Dream 39.586384 4253.510337" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " epsilon=1.0,\n", " select_cols = [\"island\", \"bill_length_mm\", \"body_mass_g\"],\n", " synth_params = {\n", " \"embedding_dim\": 256, \n", " \"generator_dim\": (128, 128), \n", " \"discriminator_dim\": (256, 256),\n", " \"generator_lr\": 0.0003, \n", " \"generator_decay\": 1e-05, \n", " \"discriminator_lr\": 0.0003, \n", " \"discriminator_decay\": 1e-05, \n", " \"batch_size\": 500\n", " },\n", " nb_samples = 100,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "62", "metadata": {}, "source": [ "## DPGAN: DIfferentially Private GAN" ] }, { "cell_type": "markdown", "id": "63", "metadata": {}, "source": [ "For DPGAN, there is the same warning as for DPCTGAN with the cryptographically secure random noise generation." ] }, { "cell_type": "code", "execution_count": null, "id": "64", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:44: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooDream50.92663023.000000196.6824334906.792127FEMALE
1ChinstrapDream43.68623322.855870186.1573874108.924724FEMALE
2AdelieBiscoe43.87498822.465074250.0000004141.524814FEMALE
3GentooDream49.63725419.829533190.5520573293.796897FEMALE
4GentooBiscoe65.00000023.000000185.2391484287.198659FEMALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Dream 50.926630 23.000000 196.682433 \n", "1 Chinstrap Dream 43.686233 22.855870 186.157387 \n", "2 Adelie Biscoe 43.874988 22.465074 250.000000 \n", "3 Gentoo Dream 49.637254 19.829533 190.552057 \n", "4 Gentoo Biscoe 65.000000 23.000000 185.239148 \n", "\n", " body_mass_g sex \n", "0 4906.792127 FEMALE \n", "1 4108.924724 FEMALE \n", "2 4141.524814 FEMALE \n", "3 3293.796897 FEMALE \n", "4 4287.198659 FEMALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "65", "metadata": {}, "source": [ "One final time she samples with conditions:" ] }, { "cell_type": "code", "execution_count": null, "id": "66", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:44: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieBiscoe65.00000017.500878194.4632255220.095709FEMALE
1GentooTorgersen65.00000017.846123236.1593817000.000000FEMALE
2AdelieBiscoe62.84484917.839089195.6721687000.000000FEMALE
3AdelieDream62.49505923.000000213.2720407000.000000MALE
4ChinstrapBiscoe65.00000016.639676228.4773147000.000000MALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Adelie Biscoe 65.000000 17.500878 194.463225 \n", "1 Gentoo Torgersen 65.000000 17.846123 236.159381 \n", "2 Adelie Biscoe 62.844849 17.839089 195.672168 \n", "3 Adelie Dream 62.495059 23.000000 213.272040 \n", "4 Chinstrap Biscoe 65.000000 16.639676 228.477314 \n", "\n", " body_mass_g sex \n", "0 5220.095709 FEMALE \n", "1 7000.000000 FEMALE \n", "2 7000.000000 FEMALE \n", "3 7000.000000 MALE \n", "4 7000.000000 MALE " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " condition = \"body_mass_g > 5000\",\n", " dummy=True,\n", ")\n", "res_dummy.result.df_samples.head()" ] }, { "cell_type": "markdown", "id": "67", "metadata": {}, "source": [ "And now on the real dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "68", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/azureuser/work/sdd-poc-server/client/lomas_client/utils.py:44: UserWarning: Warning:dpgan synthesizer random generator for noise and shuffling is not cryptographically secure. (pseudo-rng in vanilla PyTorch).\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooTorgersen43.03799017.304015242.3359905220.064640FEMALE
1GentooTorgersen54.58211718.683831196.1319045238.780871FEMALE
2GentooTorgersen47.12051622.075984181.1804267000.000000
3ChinstrapTorgersen65.00000017.883028197.0280625078.756407MALE
4AdelieDream65.00000016.657900231.2027757000.000000MALE
5AdelieDream65.00000017.988916185.3336177000.000000MALE
6AdelieTorgersen65.00000018.632700250.0000005856.899589MALE
7AdelieBiscoe44.83316915.574961191.5628666073.620439FEMALE
8AdelieDream65.00000017.337221199.8194787000.000000
9AdelieTorgersen43.83932321.674445212.7028557000.000000
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Torgersen 43.037990 17.304015 242.335990 \n", "1 Gentoo Torgersen 54.582117 18.683831 196.131904 \n", "2 Gentoo Torgersen 47.120516 22.075984 181.180426 \n", "3 Chinstrap Torgersen 65.000000 17.883028 197.028062 \n", "4 Adelie Dream 65.000000 16.657900 231.202775 \n", "5 Adelie Dream 65.000000 17.988916 185.333617 \n", "6 Adelie Torgersen 65.000000 18.632700 250.000000 \n", "7 Adelie Biscoe 44.833169 15.574961 191.562866 \n", "8 Adelie Dream 65.000000 17.337221 199.819478 \n", "9 Adelie Torgersen 43.839323 21.674445 212.702855 \n", "\n", " body_mass_g sex \n", "0 5220.064640 FEMALE \n", "1 5238.780871 FEMALE \n", "2 7000.000000 \n", "3 5078.756407 MALE \n", "4 7000.000000 MALE \n", "5 7000.000000 MALE \n", "6 5856.899589 MALE \n", "7 6073.620439 FEMALE \n", "8 7000.000000 \n", "9 7000.000000 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"dpgan\",\n", " epsilon=1.0,\n", " condition = \"body_mass_g > 5000\",\n", " nb_samples = 10,\n", " dummy=False,\n", ")\n", "res_dummy.result.df_samples" ] }, { "cell_type": "markdown", "id": "69", "metadata": {}, "source": [ "## Step 6: See archives of queries" ] }, { "cell_type": "markdown", "id": "70", "metadata": {}, "source": [ "She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses." ] }, { "cell_type": "code", "execution_count": null, "id": "71", "metadata": {}, "outputs": [], "source": [ "previous_queries = client.get_previous_queries()" ] }, { "cell_type": "markdown", "id": "72", "metadata": {}, "source": [ "Let's check the last query" ] }, { "cell_type": "code", "execution_count": null, "id": "73", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr.Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_library': 'smartnoise_synth',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'synth_name': 'dpgan',\n", " 'epsilon': 1.0,\n", " 'delta': None,\n", " 'select_cols': [],\n", " 'synth_params': {},\n", " 'nullable': True,\n", " 'constraints': '',\n", " 'return_model': False,\n", " 'condition': 'body_mass_g > 5000',\n", " 'nb_samples': 10},\n", " 'response': {'epsilon': 1.0,\n", " 'delta': 0.00015673368198174188,\n", " 'requested_by': 'Dr.Antartica',\n", " 'result': res_type \\\n", " index sn_synth_samples \n", " columns sn_synth_samples \n", " data sn_synth_samples \n", " index_names sn_synth_samples \n", " column_names sn_synth_samples \n", " \n", " df_samples \n", " index [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] \n", " columns [species, island, bill_length_mm, bill_depth_m... \n", " data [[Gentoo, Torgersen, 43.03798981010914, 17.304... \n", " index_names [None] \n", " column_names [None] },\n", " 'timestamp': 1746086943.2004273}" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "last_query = previous_queries[-1]\n", "last_query" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }