{ "cells": [ { "cell_type": "markdown", "id": "3f18d338", "metadata": {}, "source": [ "# Lomas: Client demo" ] }, { "cell_type": "markdown", "id": "1582a2ae", "metadata": {}, "source": [ "This notebook showcases how researcher could use the Lomas platform. It explains the different functionnalities provided by the `lomas-client` library to interact with the secure server.\n", "\n", "The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.\n", "\n", "Each user has access to one or multiple projects and for each dataset has a limited budget with $\\epsilon$ and $\\delta$ values." ] }, { "cell_type": "code", "execution_count": 1, "id": "23bb4f13-7800-41b2-b429-68c2d02243d0", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 1, "metadata": { "image/png": { "width": 800 } }, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "Image(filename=\"images/image_demo_client.png\", width=800)" ] }, { "cell_type": "markdown", "id": "5b73135c", "metadata": {}, "source": [ "🐧🐧🐧\n", "In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins dimensions.\n", "\n", "Therefore, the powerful queen Icerbegina 👑 had the data collected. But in order to get the penguins to agree to participate she promised them that no one would be able to look at the data and that no one would be able to guess the bill width of any specific penguin (which is very sensitive information) from the data. Nobody! Not even the researchers. The queen hence stored the data on the Secure Data Disclosure Server and only gave a small budget to Dr. Antarctica.\n", "\n", "This is not a problem for Dr. Antarctica as she does not need to see the data to make statistics thanks to the Secure Data Disclosure Client library `lomas-client`. \n", "🐧🐧🐧" ] }, { "cell_type": "markdown", "id": "01ae30d2", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": 31, "id": "28fbdd79-8c15-49a9-bcf9-fcdeac09d2b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: lomas-client in /usr/local/lib/python3.12/site-packages (0.3.3)\n", "Requirement already satisfied: diffprivlib>=0.6.4 in /usr/local/lib/python3.12/site-packages (from lomas-client) (0.6.4)\n", "Requirement already satisfied: diffprivlib-logger>=0.0.3 in /usr/local/lib/python3.12/site-packages (from lomas-client) (0.0.3)\n", "Requirement already satisfied: numpy>=1.26.2 in /usr/local/lib/python3.12/site-packages (from lomas-client) (1.26.2)\n", "Requirement already satisfied: opendp==0.10.0 in /usr/local/lib/python3.12/site-packages (from lomas-client) (0.10.0)\n", "Requirement already satisfied: opendp-logger==0.3.0 in /usr/local/lib/python3.12/site-packages (from lomas-client) (0.3.0)\n", "Requirement already satisfied: pandas>=2.2.2 in /usr/local/lib/python3.12/site-packages (from lomas-client) (2.2.2)\n", "Requirement already satisfied: requests>=2.32.0 in /usr/local/lib/python3.12/site-packages (from lomas-client) (2.32.0)\n", "Requirement already satisfied: scikit-learn==1.4.0 in /usr/local/lib/python3.12/site-packages (from lomas-client) (1.4.0)\n", "Requirement already satisfied: smartnoise-synth==1.0.4 in /usr/local/lib/python3.12/site-packages (from lomas-client) (1.0.4)\n", "Requirement already satisfied: smartnoise-synth-logger==0.0.3 in /usr/local/lib/python3.12/site-packages (from lomas-client) (0.0.3)\n", "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/site-packages (from scikit-learn==1.4.0->lomas-client) (1.14.1)\n", "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/site-packages (from scikit-learn==1.4.0->lomas-client) (1.4.2)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.12/site-packages (from scikit-learn==1.4.0->lomas-client) (3.5.0)\n", "Requirement already satisfied: Faker>=17.0.0 in /usr/local/lib/python3.12/site-packages (from smartnoise-synth==1.0.4->lomas-client) (30.6.0)\n", "Requirement already satisfied: opacus<0.15.0,>=0.14.0 in /usr/local/lib/python3.12/site-packages (from smartnoise-synth==1.0.4->lomas-client) (0.14.0)\n", "Requirement already satisfied: pac-synth<0.0.9,>=0.0.8 in /usr/local/lib/python3.12/site-packages (from smartnoise-synth==1.0.4->lomas-client) (0.0.8)\n", "Requirement already satisfied: smartnoise-sql<2.0.0,>=1.0.4 in /usr/local/lib/python3.12/site-packages (from smartnoise-synth==1.0.4->lomas-client) (1.0.4)\n", "Requirement already satisfied: torch>=2.2.0 in /usr/local/lib/python3.12/site-packages (from smartnoise-synth==1.0.4->lomas-client) (2.4.1)\n", "Requirement already satisfied: setuptools>=49.0.0 in /usr/local/lib/python3.12/site-packages (from diffprivlib>=0.6.4->lomas-client) (75.2.0)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/site-packages (from pandas>=2.2.2->lomas-client) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/site-packages (from pandas>=2.2.2->lomas-client) (2024.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/site-packages (from pandas>=2.2.2->lomas-client) (2024.2)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.12/site-packages (from requests>=2.32.0->lomas-client) (3.4.0)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/site-packages (from requests>=2.32.0->lomas-client) (3.10)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/site-packages (from requests>=2.32.0->lomas-client) (2.2.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/site-packages (from requests>=2.32.0->lomas-client) (2024.8.30)\n", "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.12/site-packages (from Faker>=17.0.0->smartnoise-synth==1.0.4->lomas-client) (4.12.2)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas>=2.2.2->lomas-client) (1.16.0)\n", "Requirement already satisfied: PyYAML<7.0.0,>=6.0.1 in /usr/local/lib/python3.12/site-packages (from smartnoise-sql<2.0.0,>=1.0.4->smartnoise-synth==1.0.4->lomas-client) (6.0.2)\n", "Requirement already satisfied: antlr4-python3-runtime==4.9.3 in /usr/local/lib/python3.12/site-packages (from smartnoise-sql<2.0.0,>=1.0.4->smartnoise-synth==1.0.4->lomas-client) (4.9.3)\n", "Requirement already satisfied: graphviz<0.18,>=0.17 in /usr/local/lib/python3.12/site-packages (from smartnoise-sql<2.0.0,>=1.0.4->smartnoise-synth==1.0.4->lomas-client) (0.17)\n", "Requirement already satisfied: sqlalchemy<3.0.0,>=2.0.0 in /usr/local/lib/python3.12/site-packages (from smartnoise-sql<2.0.0,>=1.0.4->smartnoise-synth==1.0.4->lomas-client) (2.0.36)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (3.16.1)\n", "Requirement already satisfied: sympy in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (1.13.3)\n", "Requirement already satisfied: networkx in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (3.4.1)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (3.1.4)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (2024.9.0)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.105)\n", "Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (9.1.0.70)\n", "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.3.1)\n", "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (11.0.2.54)\n", "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (10.3.2.106)\n", "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (11.4.5.107)\n", "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.0.106)\n", "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (2.20.5)\n", "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.1.105)\n", "Requirement already satisfied: triton==3.0.0 in /usr/local/lib/python3.12/site-packages (from torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (3.0.0)\n", "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.12/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (12.6.77)\n", "Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.12/site-packages (from sqlalchemy<3.0.0,>=2.0.0->smartnoise-sql<2.0.0,>=1.0.4->smartnoise-synth==1.0.4->lomas-client) (3.1.1)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/site-packages (from jinja2->torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (3.0.1)\n", "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/site-packages (from sympy->torch>=2.2.0->smartnoise-synth==1.0.4->lomas-client) (1.3.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip install lomas-client" ] }, { "cell_type": "code", "execution_count": 2, "id": "6fb569fc", "metadata": {}, "outputs": [], "source": [ "from lomas_client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "9c63718b", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "To create the client, Dr. Antartica needs to give it a few parameters:\n", "- a url: the root application endpoint to the remote secure server.\n", "- user_name: her name as registered in the database (Dr. Alice Antartica)\n", "- dataset_name: the name of the dataset that she wants to query (PENGUIN)\n", "\n", "She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit (as is done in the Admin Notebook for Users and Datasets management)." ] }, { "cell_type": "code", "execution_count": 3, "id": "941991f7", "metadata": {}, "outputs": [], "source": [ "APP_URL = \"http://lomas_server\"\n", "USER_NAME = \"Dr. Antartica\"\n", "DATASET_NAME = \"PENGUIN\"\n", "client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)" ] }, { "cell_type": "markdown", "id": "0ec400c8", "metadata": {}, "source": [ "And that's it for the preparation. She is now ready to use the various functionnalities offered by `lomas_client`." ] }, { "cell_type": "markdown", "id": "9b9a5f13", "metadata": {}, "source": [ "## Step 3: Understand the functionnalities of the library" ] }, { "cell_type": "markdown", "id": "c7cb5531", "metadata": {}, "source": [ "### a. Getting dataset metadata\n", "\n", "Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the `get_dataset_metadata()` function of the client. As this is public information, this does not cost any budget.\n", "\n", "This function returns metadata information in a format based on [SmartnoiseSQL dictionary format](https://docs.smartnoise.org/sql/metadata.html#dictionary-format), where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details). Any metadata is required for Smartnoise-SQL is also required here and additional information such that the different categories in a string type column column can be added." ] }, { "cell_type": "code", "execution_count": 4, "id": "0fdebac9-57fc-4410-878b-5a77425af634", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 344,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin_metadata = client.get_dataset_metadata()\n", "penguin_metadata" ] }, { "cell_type": "markdown", "id": "d338ed96", "metadata": {}, "source": [ "Based on this Dr. Antartica knows that there are 7 columns, 3 of string type (species, island, sex) with their associated categories (i.e. the species column has 3 possibilities: 'Adelie', 'Chinstrap', 'Gentoo') and 4 of float type (bill length, bill depth, flipper length and body mass) with their associated bounds (i.e. the body mass of penguin ranges from 2000 to 7000 gramms). She also knows based on the field `max_ids: 1` that each penguin can only be once in the dataset and on the field `row_privacy: True` that each row represents a single penguin. Finally, she learns that there are 344 rows in the dataset and hence 344 penguins." ] }, { "cell_type": "code", "execution_count": 5, "id": "8719c070-16a3-4228-a09f-944178aa1ba7", "metadata": {}, "outputs": [], "source": [ "NB_PENGUINS = penguin_metadata[\"rows\"]" ] }, { "cell_type": "markdown", "id": "5a3c899d", "metadata": {}, "source": [ "### b. Get a dummy dataset\n", "\n", "Now, that she has seen and understood the metadata, she wants to get an even better understanding of the dataset (but is still not able to see it). A solution to have an idea of what the dataset looks like it to create a dummy dataset. \n", "\n", "Based on the public metadata of the dataset, a random dataframe can be created created. By default, there will be 100 rows and the seed is set to 42 to ensure reproducibility, but these 2 variables can be changed to obtain different dummy datasets.\n", "Getting a dummy dataset does not affect the budget as there is no differential privacy here. It is not a synthetic dataset and all that could be learn here is already present in the public metadata (it is created randomly on the fly based on the metadata).\n", "\n", "Dr. Antartica first create a dummy dataset with 100 rows and chooses a seed of 0." ] }, { "cell_type": "code", "execution_count": 6, "id": "01f4365a", "metadata": {}, "outputs": [], "source": [ "NB_ROWS = 100\n", "SEED = 0" ] }, { "cell_type": "code", "execution_count": 7, "id": "3f553b29", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(100, 7)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0GentooBiscoe46.79957716.196816239.6801233010.840470FEMALE
1ChinstrapDream38.13305214.875077208.3320056689.525543MALE
2ChinstrapTorgersen58.06582019.725266154.0218222473.883392MALE
3AdelieTorgersen62.32355614.951074221.1486822024.497075FEMALE
4AdelieDream39.31456018.776879206.9025853614.604018MALE
\n", "
" ], "text/plain": [ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", "0 Gentoo Biscoe 46.799577 16.196816 239.680123 \n", "1 Chinstrap Dream 38.133052 14.875077 208.332005 \n", "2 Chinstrap Torgersen 58.065820 19.725266 154.021822 \n", "3 Adelie Torgersen 62.323556 14.951074 221.148682 \n", "4 Adelie Dream 39.314560 18.776879 206.902585 \n", "\n", " body_mass_g sex \n", "0 3010.840470 FEMALE \n", "1 6689.525543 MALE \n", "2 2473.883392 MALE \n", "3 2024.497075 FEMALE \n", "4 3614.604018 MALE " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dummy = client.get_dummy_dataset(\n", " nb_rows = NB_ROWS, \n", " seed = SEED\n", ")\n", "\n", "print(df_dummy.shape)\n", "df_dummy.head()" ] }, { "cell_type": "markdown", "id": "bb03bf8f-62b7-4929-8757-e662eee7de41", "metadata": {}, "source": [ "### c. Check privacy loss budget ε, δ (initial, current, remaining)" ] }, { "cell_type": "markdown", "id": "d4d5b481-0cf9-4de9-86b9-158ac413af6b", "metadata": {}, "source": [ "It is the first time that Dr. Antartica connects to the server and she wants to know how much buget has beeen assigned to her.\n", "Therefore, she calls the fonction `get_initial_budget`." ] }, { "cell_type": "code", "execution_count": 8, "id": "9bd99db9-9de9-4b25-8718-989fea27b15a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "InitialBudgetResponse(initial_epsilon=10.0, initial_delta=0.005)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_initial_budget()" ] }, { "cell_type": "markdown", "id": "b48a8f93-8be9-4b9d-a6e8-8811d749650d", "metadata": {}, "source": [ "She sees that she has 10.0 epsilon and 0.005 epsilon at her disposal.\n", "\n", "Then she checks her total spent budget `get_total_spent_budget`. As she only did queries on metadata on dummy dataframes, this should still be 0." ] }, { "cell_type": "code", "execution_count": 9, "id": "99a4dd26-53af-412e-bcd1-f06fff57e6a4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpentBudgetResponse(total_spent_epsilon=1.0, total_spent_delta=4.999999999999449e-05)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_total_spent_budget()" ] }, { "cell_type": "markdown", "id": "c4c7d708-90f9-4bdf-a93e-ea3007609b62", "metadata": {}, "source": [ "It will also be useful to know what the remaining budget is. Therefore, she calls the function `get_remaining_budget`. It just substarcts the total spent budget from the initial budget." ] }, { "cell_type": "code", "execution_count": 10, "id": "f67e0596-5f96-4c8b-a843-3fbaef02bab1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=9.0, remaining_delta=0.004950000000000006)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "markdown", "id": "1ac73b0b-f9db-4513-af13-0bb2f75f8977", "metadata": {}, "source": [ "As expected, for now the remaining budget is equal to the inital budget." ] }, { "cell_type": "markdown", "id": "a05e25b0-4ece-45b7-af37-cb3d5c987bac", "metadata": {}, "source": [ "## Step 4: Use DP libraries to analyse the dataset\n", "Available DP libraires are:\n", "- Smartnoise-SQL for SQL-like queries\n", "- Smartnoise-Synth for generating synthetic datasets\n", "- OpenDP for summary statistics\n", "- DiffPrivLib for training Machine Learning models" ] }, { "cell_type": "markdown", "id": "7533c5e9-b937-41c7-8271-2a9d46e8d228", "metadata": {}, "source": [ "For each library, there are three possibilities: \n", "- estimate the cost of a query (will NOT spend privacy loss budget)\n", "- query on a 'dummy' dataset (explained below) (will NOT spend privacy loss budget)\n", "- query on the private dataset (WILL SPEND PRIVACY LOSS BUDGET)" ] }, { "cell_type": "markdown", "id": "dd5d21f0-73de-426d-b25b-7e991787b7af", "metadata": {}, "source": [ "### a. Compute average bill length with Smartnoise-SQL" ] }, { "cell_type": "markdown", "id": "42df0cae-1a9e-4c4b-ba0f-beac838d8826", "metadata": {}, "source": [ "Dr. Antartica wants to know the average bill length of penguins. Therefore, she will use `smartnoise-sql` library and write the associated SQL command." ] }, { "cell_type": "code", "execution_count": 11, "id": "a0de0cfa-af54-46f7-9144-8778fb1a66c5", "metadata": {}, "outputs": [], "source": [ "# Average bill length in mm\n", "QUERY = \"SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df\"" ] }, { "cell_type": "markdown", "id": "211510aa-b4ce-4cfa-9ffb-4a540b5e0c49", "metadata": {}, "source": [ "#### Estimate cost of a query with smartnoise-sql\n", "She will then estimate the cost of this query. In the various DP librairies the budget that will by used by a query in the server might be slightly different than what is asked by the user in inptu. The `estimate cost` function of each library returns the cost that will effectively be sent and deduced if the query is applied on the sensitive dataset.\n", "\n", "The user can then decide to use the budget or modify it. Again, of course, this will not impact the user's budget.\n", "\n", "Dr. Antartica checks the budget that computing the average bill length will really cost her if she asks the query with an `epsilon` and a `delta`." ] }, { "cell_type": "code", "execution_count": 12, "id": "47663a1f-2b91-4f8a-8565-b3d7c9667e76", "metadata": {}, "outputs": [], "source": [ "EPSILON = 0.5\n", "DELTA = 1e-4" ] }, { "cell_type": "code", "execution_count": 13, "id": "e1a2b948-cf11-4325-a05e-147a0b4aaa30", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.0, delta=4.999999999999449e-05)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost = client.smartnoise_sql.cost(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA\n", ")\n", "cost" ] }, { "cell_type": "code", "execution_count": 14, "id": "4547b70f-0623-4ae6-93f1-9eaca724e514", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This query would actually cost her 1.0 epsilon and 4.999999999999449e-05 delta.\n" ] } ], "source": [ "print(f\"This query would actually cost her {cost.epsilon} epsilon and {cost.delta} delta.\")" ] }, { "cell_type": "markdown", "id": "a93f7c2b-e30a-4be5-bf56-ad61f2834673", "metadata": {}, "source": [ "She decides that it is good enough." ] }, { "cell_type": "markdown", "id": "98e6fda2-dde7-4f8b-a787-c9a1e3571ebe", "metadata": {}, "source": [ "#### Query average bill length on dummy dataset with smartnoise-sql\n", "She now wants to start querying the real dataset for her research. \n", "\n", "However, her budget is limited and it would be a waste to spend it by mistake on a coding error. Therefore the client/server pipeline has functionnal testing capabilities for the users. It is possible to test a query on a `dummy` dataset to ensure that everything is working properly. Dr. Antartica will not be able to use the results of a dummy query for her analysis (as the data is random) but if the query on the dummy dataset works, she can be confident that her query will also work on the real dataset.\n", "This functionnal testing on the dummy does not have any impact on the budget as it is on random data only.\n", "\n", "To test on the dummy data instead of the real data, the function call is exactly the same with the only exception of the flag `dummy=True`. In the following cell, she will test with `smartnoise_query` but it is the same flag for `opendp.query`. She can optionnaly give two additional parameters to set the seed and the number of rows of the dummy dataset.\n", "\n", "Another more advanced possibility for functionnal tests with the dummy is to compare results of queries on a local dummy and the remote dummy with a very high budget: \n", "- create a local dummy on the notebook with a specific seed and number of rows\n", "- compute locally the wanted query on this local dummy with python functions like numpy\n", "- query the server on the same remote dummy with (`dummy=True`, same seed and same number of row) and a very big buget to limit noise as much as possible (don't worry this won't cost any real budget)\n", "- compare and verify that the local and remote dummy have similar results." ] }, { "cell_type": "markdown", "id": "d1f8ea18-ccab-4f75-9490-b4d1144b39db", "metadata": {}, "source": [ "Dr. Antartica will follow the best practice and now try the query to get the average bill length (in mm) on the dummy dataset. She does not forget to \n", "- set the `dummy` flag to True\n", "- set very high budget values to be able to compare results with a similar local dummy (with the same seed and number of rows) if she wants to verify that the function do what is expected. Here she will just check that the number of rows is close to what she sets as parameter." ] }, { "cell_type": "code", "execution_count": 15, "id": "90cf2a6d", "metadata": {}, "outputs": [], "source": [ "# On the remote server dummy dataframe\n", "dummy_res = client.smartnoise_sql.query(\n", " query = QUERY,\n", " epsilon = 100.0,\n", " delta = 0.99,\n", " dummy = True, \n", " nb_rows = NB_ROWS,\n", " seed = SEED\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "id": "a30f277e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average bill length in remote dummy: 48.58mm.\n" ] } ], "source": [ "print(f\"Average bill length in remote dummy: {np.round(dummy_res.result.df['avg_bill_length_mm'][0], 2)}mm.\")" ] }, { "cell_type": "markdown", "id": "167e8c6d-6c93-4ab4-9ba7-bf7e783a6bc2", "metadata": {}, "source": [ "No functionnal errors happened and the average bill length is within reasonable bounds. She is now even more confident in using her query on the server." ] }, { "cell_type": "markdown", "id": "e5379edf", "metadata": {}, "source": [ "#### Query average bill length on private dataset with smartnoise-sql\n", "Now that all the safeguard functions were tested, Dr. Antartica is ready to query on the real dataset and get a differentially private response of the number of penguins and average bill length. By default, the flag `dummy` is False so setting it is optional. She uses the values of `epsilon` and `delta` that she selected just before.\n", "\n", "Careful: This command DOES spend the budget of the user and the remaining budget is updated for every query." ] }, { "cell_type": "code", "execution_count": 17, "id": "19e60263", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=9.0, remaining_delta=0.004950000000000006)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "code", "execution_count": 18, "id": "69767fac", "metadata": {}, "outputs": [], "source": [ "response = client.smartnoise_sql.query(\n", " query = QUERY, \n", " epsilon = EPSILON, \n", " delta = DELTA,\n", " dummy = False # APPLIED ON SENSITIVE DATA, WILL SPEND BUDGET\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "id": "6dbbdf93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average bill length of penguins in real data: 43.52mm.\n" ] } ], "source": [ "avg_bill_length = np.round(response.result.df['avg_bill_length_mm'].iloc[0], 2)\n", "print(f\"Average bill length of penguins in real data: {avg_bill_length}mm.\")" ] }, { "cell_type": "markdown", "id": "b2767e65", "metadata": {}, "source": [ "After each query on the real dataset, the budget informations are also returned to the researcher. It is possible possible to check the remaining budget again afterwards:" ] }, { "cell_type": "code", "execution_count": 20, "id": "39701fe5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RemainingBudgetResponse(remaining_epsilon=8.0, remaining_delta=0.004900000000000011)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_remaining_budget()" ] }, { "cell_type": "markdown", "id": "e37c587f", "metadata": {}, "source": [ "As can be seen in `get_total_spent_budget()`, it is the budget estimated with `estimate_smartnoise_cost()` that was spent." ] }, { "cell_type": "code", "execution_count": 21, "id": "487f835f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SpentBudgetResponse(total_spent_epsilon=2.0, total_spent_delta=9.999999999998899e-05)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client.get_total_spent_budget()" ] }, { "cell_type": "markdown", "id": "eef4afcd", "metadata": {}, "source": [ "Dr. Antartica has now a differentially private estimation of the number of penguins in the dataset and is confident to use the library for the rest of her analyses." ] }, { "cell_type": "markdown", "id": "04929993", "metadata": {}, "source": [ "### b. Compute confidence interval with opendp" ] }, { "cell_type": "code", "execution_count": 22, "id": "b9685226", "metadata": {}, "outputs": [], "source": [ "import opendp as dp\n", "import opendp.transformations as trans\n", "import opendp.measurements as meas" ] }, { "cell_type": "markdown", "id": "9d41bd58", "metadata": {}, "source": [ "She now wants the confidence interval of bill length in mm. She already has the number of penguins and the average from the metadata and previous smartnoise-sql queries respectively. She now needs the variance value.\n", "\n", "#### Prepare opendp pipeline and verify on dummy\n", "She checks the metadata of the columns again to use the relevant values in the pipeline." ] }, { "cell_type": "code", "execution_count": 23, "id": "4331d86f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin_metadata[\"columns\"]" ] }, { "cell_type": "markdown", "id": "f90e0425", "metadata": {}, "source": [ "She can define the columns names and the bounds of the relevant column." ] }, { "cell_type": "code", "execution_count": 24, "id": "ff8cb7b6", "metadata": {}, "outputs": [], "source": [ "columns = list(penguin_metadata[\"columns\"].keys())" ] }, { "cell_type": "code", "execution_count": 25, "id": "70b2bdb1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30.0, 65.0)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bill_length_min = penguin_metadata['columns']['bill_length_mm']['lower']\n", "bill_length_max = penguin_metadata['columns']['bill_length_mm']['upper']\n", "bill_length_min, bill_length_max" ] }, { "cell_type": "markdown", "id": "e93ae087", "metadata": {}, "source": [ "She can now define the pipeline of the transformation to have the variance that she wants on the data:" ] }, { "cell_type": "code", "execution_count": 26, "id": "75e4933b", "metadata": {}, "outputs": [], "source": [ "bill_length_transformation_pipeline = (\n", " trans.make_split_dataframe(separator=\",\", col_names=columns) >>\n", " trans.make_select_column(key=\"bill_length_mm\", TOA=str) >>\n", " trans.then_cast_default(TOA=float) >>\n", " trans.then_clamp(bounds=(bill_length_min, bill_length_max)) >>\n", " trans.then_resize(size=NB_PENGUINS, constant=avg_bill_length) >>\n", " trans.then_variance()\n", ")" ] }, { "cell_type": "markdown", "id": "411d464c", "metadata": {}, "source": [ "However, when she tries to execute it on the server, she has an error (see below). " ] }, { "cell_type": "code", "execution_count": 27, "id": "8041a647", "metadata": {}, "outputs": [ { "ename": "ValidationError", "evalue": "1 validation error for tagged-union[InvalidQueryExceptionModel,ExternalLibraryExceptionModel,UnauthorizedAccessExceptionModel,InternalServerExceptionModel]\n JSON input should be string, bytes or bytearray [type=json_type, input_value={'type': 'InvalidQueryExc...cessed in this server.'}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.9/v/json_type", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValidationError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[27], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# No instruction for noise addition mechanism: Expect to fail !!!\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[43mclient\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mopendp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mopendp_pipeline\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mbill_length_transformation_pipeline\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 5\u001b[0m \u001b[43m)\u001b[49m\n", "File \u001b[0;32m/code/lomas_client/libraries/opendp.py:105\u001b[0m, in \u001b[0;36mOpenDPClient.query\u001b[0;34m(self, opendp_pipeline, fixed_delta, dummy, nb_rows, seed)\u001b[0m\n\u001b[1;32m 102\u001b[0m body \u001b[38;5;241m=\u001b[39m request_model\u001b[38;5;241m.\u001b[39mmodel_validate(body_dict)\n\u001b[1;32m 103\u001b[0m res \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhttp_client\u001b[38;5;241m.\u001b[39mpost(endpoint, body)\n\u001b[0;32m--> 105\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mvalidate_model_response\u001b[49m\u001b[43m(\u001b[49m\u001b[43mres\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mQueryResponse\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/code/lomas_client/utils.py:90\u001b[0m, in \u001b[0;36mvalidate_model_response\u001b[0;34m(response, response_model)\u001b[0m\n\u001b[1;32m 87\u001b[0m r_model \u001b[38;5;241m=\u001b[39m response_model\u001b[38;5;241m.\u001b[39mmodel_validate_json(data)\n\u001b[1;32m 88\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m r_model\n\u001b[0;32m---> 90\u001b[0m \u001b[43mraise_error\u001b[49m\u001b[43m(\u001b[49m\u001b[43mresponse\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 91\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n", "File \u001b[0;32m/code/lomas_client/utils.py:31\u001b[0m, in \u001b[0;36mraise_error\u001b[0;34m(response)\u001b[0m\n\u001b[1;32m 22\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mraise_error\u001b[39m(response: requests\u001b[38;5;241m.\u001b[39mResponse) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mstr\u001b[39m:\n\u001b[1;32m 23\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Raise error message based on the HTTP response.\u001b[39;00m\n\u001b[1;32m 24\u001b[0m \n\u001b[1;32m 25\u001b[0m \u001b[38;5;124;03m Args:\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 29\u001b[0m \u001b[38;5;124;03m Server Error\u001b[39;00m\n\u001b[1;32m 30\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m---> 31\u001b[0m error_model \u001b[38;5;241m=\u001b[39m \u001b[43mLomasServerExceptionTypeAdapter\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_json\u001b[49m\u001b[43m(\u001b[49m\u001b[43mresponse\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mjson\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 32\u001b[0m \u001b[38;5;28;01mmatch\u001b[39;00m error_model:\n\u001b[1;32m 33\u001b[0m \u001b[38;5;28;01mcase\u001b[39;00m InvalidQueryExceptionModel():\n", "File \u001b[0;32m/usr/local/lib/python3.12/site-packages/pydantic/type_adapter.py:135\u001b[0m, in \u001b[0;36m_frame_depth..wrapper..wrapped\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 132\u001b[0m \u001b[38;5;129m@wraps\u001b[39m(func)\n\u001b[1;32m 133\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m: TypeAdapterT, \u001b[38;5;241m*\u001b[39margs: P\u001b[38;5;241m.\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs: P\u001b[38;5;241m.\u001b[39mkwargs) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m R:\n\u001b[1;32m 134\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_with_frame_depth(depth \u001b[38;5;241m+\u001b[39m \u001b[38;5;241m1\u001b[39m): \u001b[38;5;66;03m# depth + 1 for the wrapper function\u001b[39;00m\n\u001b[0;32m--> 135\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/usr/local/lib/python3.12/site-packages/pydantic/type_adapter.py:384\u001b[0m, in \u001b[0;36mTypeAdapter.validate_json\u001b[0;34m(self, data, strict, context)\u001b[0m\n\u001b[1;32m 368\u001b[0m \u001b[38;5;129m@_frame_depth\u001b[39m(\u001b[38;5;241m1\u001b[39m)\n\u001b[1;32m 369\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mvalidate_json\u001b[39m(\n\u001b[1;32m 370\u001b[0m \u001b[38;5;28mself\u001b[39m, data: \u001b[38;5;28mstr\u001b[39m \u001b[38;5;241m|\u001b[39m \u001b[38;5;28mbytes\u001b[39m, \u001b[38;5;241m/\u001b[39m, \u001b[38;5;241m*\u001b[39m, strict: \u001b[38;5;28mbool\u001b[39m \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m, context: \u001b[38;5;28mdict\u001b[39m[\u001b[38;5;28mstr\u001b[39m, Any] \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 371\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m T:\n\u001b[1;32m 372\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Usage docs: https://docs.pydantic.dev/2.9/concepts/json/#json-parsing\u001b[39;00m\n\u001b[1;32m 373\u001b[0m \n\u001b[1;32m 374\u001b[0m \u001b[38;5;124;03m Validate a JSON string or bytes against the model.\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 382\u001b[0m \u001b[38;5;124;03m The validated object.\u001b[39;00m\n\u001b[1;32m 383\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 384\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidator\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_json\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstrict\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstrict\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcontext\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcontext\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[0;31mValidationError\u001b[0m: 1 validation error for tagged-union[InvalidQueryExceptionModel,ExternalLibraryExceptionModel,UnauthorizedAccessExceptionModel,InternalServerExceptionModel]\n JSON input should be string, bytes or bytearray [type=json_type, input_value={'type': 'InvalidQueryExc...cessed in this server.'}, input_type=dict]\n For further information visit https://errors.pydantic.dev/2.9/v/json_type" ] } ], "source": [ "# No instruction for noise addition mechanism: Expect to fail !!!\n", "client.opendp.query(\n", " opendp_pipeline = bill_length_transformation_pipeline,\n", " dummy=True\n", ")" ] }, { "cell_type": "markdown", "id": "d06c59dc", "metadata": {}, "source": [ "This is because the server will only allow measurement pipeline with differentially private results. She adds Laplacian noise to the pipeline and should be able to instantiate the pipeline." ] }, { "cell_type": "code", "execution_count": 28, "id": "b8162859", "metadata": {}, "outputs": [], "source": [ "var_bill_length_measurement_pipeline = (\n", " bill_length_transformation_pipeline >>\n", " meas.then_laplace(scale=5.0) # Noise addition mechanism instructions\n", ")" ] }, { "cell_type": "markdown", "id": "fc7e0ecd", "metadata": {}, "source": [ "Now that there is a measurement, she is able to apply the pipeline on the dummy dataset of the server." ] }, { "cell_type": "code", "execution_count": 29, "id": "df61bce0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy result for variance: 42.6\n" ] } ], "source": [ "dummy_var_res = client.opendp.query(\n", " opendp_pipeline = var_bill_length_measurement_pipeline, \n", " dummy=True\n", ")\n", "print(f\"Dummy result for variance: {np.round(dummy_var_res.result.value, 2)}\")" ] }, { "cell_type": "markdown", "id": "d52fd242-d45c-4a83-9fd0-01896db3e3eb", "metadata": {}, "source": [ "#### Estimate cost with opendp" ] }, { "cell_type": "markdown", "id": "ded11ac4", "metadata": {}, "source": [ "With opendp, the function `opendp.cost` is particularly useful to estimate the used `epsilon` and `delta` based on the `scale` value." ] }, { "cell_type": "code", "execution_count": 30, "id": "7ae7f735", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=0.7122093023265228, delta=0.0)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost_res = client.opendp.cost(\n", " opendp_pipeline = var_bill_length_measurement_pipeline\n", ")\n", "cost_res" ] }, { "cell_type": "markdown", "id": "5e85bff9-1b40-4ef0-8fa4-3e504ebb916d", "metadata": {}, "source": [ "#### Execute pipeline on real dataset with opendp" ] }, { "cell_type": "markdown", "id": "1c791d36", "metadata": {}, "source": [ "She can now execute the query on the real dataset." ] }, { "cell_type": "code", "execution_count": 31, "id": "085555a5", "metadata": {}, "outputs": [], "source": [ "var_res = client.opendp.query(\n", " opendp_pipeline = var_bill_length_measurement_pipeline, \n", ")" ] }, { "cell_type": "code", "execution_count": 32, "id": "674332e7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variance of bill length: 35.06 (from opendp query).\n" ] } ], "source": [ "var_bill_length = np.round(var_res.result.value, 2)\n", "print(f\"Variance of bill length: {var_bill_length} (from opendp query).\")" ] }, { "cell_type": "markdown", "id": "ddb058f9-42fa-4896-a38f-3bf67fc0b2fb", "metadata": {}, "source": [ "#### Postprocessing: no additional privacy risk with DP" ] }, { "cell_type": "markdown", "id": "367081be-1159-45d8-9129-88fba20fb697", "metadata": {}, "source": [ "She can now do all the postprocessing that she wants with the returned data without adding any privacy risk. " ] }, { "cell_type": "code", "execution_count": 33, "id": "f72b19d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standard error of bill length: 0.32.\n" ] } ], "source": [ "# Get standard error\n", "standard_error = np.sqrt(var_bill_length / NB_PENGUINS)\n", "print(f\"Standard error of bill length: {np.round(standard_error, 2)}.\")" ] }, { "cell_type": "code", "execution_count": 34, "id": "62630a03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 95% confidence interval of the bill length of all penguins is [42.89, 44.15].\n" ] } ], "source": [ " # Compute the 95% confidence interval\n", "ZSCORE = 1.96\n", "lower_bound = np.round(avg_bill_length - ZSCORE * standard_error, 2)\n", "upper_bound = np.round(avg_bill_length + ZSCORE * standard_error, 2)\n", "print(f\"The 95% confidence interval of the bill length of all penguins is [{lower_bound}, {upper_bound}].\")" ] }, { "cell_type": "markdown", "id": "668a9790-4d73-45d0-a3d1-55f979f97cb0", "metadata": {}, "source": [ "### c. Train a DP Machine Learning model with DiffPrivLib" ] }, { "cell_type": "code", "execution_count": 35, "id": "9cbcd0cf-4211-4a55-aa71-0679e7b2fa63", "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from diffprivlib import models\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "780e56d9-7395-4396-bd2c-634e9bb82c63", "metadata": {}, "source": [ "She now wants a model to predict the species of a penguin based on bill depth. Therefore, she uses a Random Forest classifier from DiffPrivLib library." ] }, { "cell_type": "markdown", "id": "cf20b748-8d4b-4f78-8d11-7f47ada6dd5f", "metadata": {}, "source": [ "#### Prepare Random Forest Classifier pipeline on dummy with DiffPrivLib" ] }, { "cell_type": "code", "execution_count": 36, "id": "b91b694d-2256-4c43-ac4f-091c6afb290a", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']\n", "target_columns = ['species']" ] }, { "cell_type": "code", "execution_count": 37, "id": "0bf7ea1f-873c-4068-ae8b-edee16316a08", "metadata": {}, "outputs": [], "source": [ "def get_bounds(cols_metadata, columns):\n", " lower = [cols_metadata[col][\"lower\"] for col in columns]\n", " upper = [cols_metadata[col][\"upper\"] for col in columns]\n", " return (lower, upper)" ] }, { "cell_type": "code", "execution_count": 38, "id": "869d409c-1ee9-4eca-8189-976f844de284", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)\n", "bounds" ] }, { "cell_type": "code", "execution_count": 39, "id": "6114c5f4-f8b1-4a8c-9770-e2a0ed7f180d", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('rf', models.RandomForestClassifier(\n", " n_estimators=10, \n", " epsilon = 2.0, \n", " bounds=bounds, \n", " classes=['Adelie', 'Chinstrap', 'Gentoo'])\n", " ),\n", "])" ] }, { "cell_type": "code", "execution_count": 40, "id": "389f3ca5-66c0-41d2-86bd-e7131bbe9184", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('rf',\n",
       "                 RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n",
       "                                        bounds=(array([  30.,   13.,  150., 2000.]),\n",
       "                                                array([  65.,   23.,  250., 7000.])),\n",
       "                                        classes=['Adelie', 'Chinstrap',\n",
       "                                                 'Gentoo'],\n",
       "                                        epsilon=2.0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('rf',\n", " RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " classes=['Adelie', 'Chinstrap',\n", " 'Gentoo'],\n", " epsilon=2.0))])" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.2,\n", " test_train_split_seed = 1,\n", " dummy = True\n", ")\n", "model = dummy_response.result.model\n", "model" ] }, { "cell_type": "markdown", "id": "1875feea-17d0-4015-bfd0-de85200ee62c", "metadata": {}, "source": [ "#### Estimate budget of Linear Regression with DiffPrivLib" ] }, { "cell_type": "code", "execution_count": 41, "id": "e9f29610-52fc-4f0e-84fd-8d85cf52eea4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=2.0, delta=0.0)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost_res = client.diffprivlib.cost(\n", " dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.2,\n", " test_train_split_seed = 1\n", ")\n", "cost_res" ] }, { "cell_type": "markdown", "id": "3012d12c-c304-4bb2-bac1-18b55094ec07", "metadata": {}, "source": [ "#### Train random forest classifier on sensitive data with DiffPrivLib" ] }, { "cell_type": "code", "execution_count": 42, "id": "62538ac0-c6aa-4950-82e8-f53510f17d77", "metadata": {}, "outputs": [], "source": [ "response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.1,\n", " test_train_split_seed = 1,\n", " dummy = False\n", ")" ] }, { "cell_type": "code", "execution_count": 43, "id": "7727b13a-cd2f-4f97-b550-88a4360fd601", "metadata": {}, "outputs": [], "source": [ "# Return the mean accuracy. \n", "model_score = response.result.score" ] }, { "cell_type": "code", "execution_count": 44, "id": "600d6c6e-7567-4564-92ad-d1538ac10af5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The model has a mean accuracy of 0.32. It is a harsh metric because we are in a multi-label classification case.'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"The model has a mean accuracy of {np.round(model_score, 2)}. It is a harsh metric because we are in a multi-label classification case.\"" ] }, { "cell_type": "code", "execution_count": 45, "id": "1824eea1-6ad8-4d2f-86d5-456a89318fef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('rf',\n",
       "                 RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n",
       "                                        bounds=(array([  30.,   13.,  150., 2000.]),\n",
       "                                                array([  65.,   23.,  250., 7000.])),\n",
       "                                        classes=['Adelie', 'Chinstrap',\n",
       "                                                 'Gentoo'],\n",
       "                                        epsilon=2.0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('rf',\n", " RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " classes=['Adelie', 'Chinstrap',\n", " 'Gentoo'],\n", " epsilon=2.0))])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = response.result.model\n", "model" ] }, { "cell_type": "code", "execution_count": 46, "id": "1e4a56bd-95bf-4355-81a6-eaf7d16f69c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'For these feature values, the predicted species is is Adelie.'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [30.0], 'bill_depth_mm': [20.0], 'flipper_length_mm': [170.0], 'body_mass_g': [5000.0]\n", "})\n", "predictions = model.predict(x_to_predict)[0]\n", "f\"For these feature values, the predicted species is is {predictions}.\"" ] }, { "cell_type": "markdown", "id": "09cb0eff-9864-42a3-a87a-cef2f7eca216", "metadata": {}, "source": [ "### d. Get a Synthetic Dataset with Smartnoise-Synth" ] }, { "cell_type": "markdown", "id": "790fe272-ae04-4daf-933e-6a9d86ac691f", "metadata": {}, "source": [ "Finally she gets a synthetic dataset to do the rest of her analysis. She chooses to only train on a subset on 3 columns: \"island\", \"bill_length_mm\" and \"bill_depth_mm\" but if we wanted she could train on the whole dataset.\n", "She also decides to use the `patectgan` synthesizer and keep all other default parameters." ] }, { "cell_type": "markdown", "id": "ded384a2-dcf6-43af-98b0-ae058b2a03ba", "metadata": {}, "source": [ "#### Train patectgan synthesizer on dummy data with Smartnoise-Synth" ] }, { "cell_type": "code", "execution_count": 47, "id": "ed292cec-8497-4e5b-b3bf-cca9227abf7d", "metadata": {}, "outputs": [], "source": [ "res_dummy = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " select_cols = [\"island\", \"bill_length_mm\", \"bill_depth_mm\"],\n", " epsilon=1.0,\n", " dummy=True,\n", ")\n", "dummy_synth_df = res_dummy.result.df_samples" ] }, { "cell_type": "code", "execution_count": 48, "id": "6fab49d9-f5c5-437e-b70f-234b84c2f7e5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandbill_length_mmbill_depth_mm
0Dream59.18928318.603612
1Dream45.89374617.438823
2Dream48.17508816.088652
3Dream57.95833817.529872
4Dream51.72294319.493872
\n", "
" ], "text/plain": [ " island bill_length_mm bill_depth_mm\n", "0 Dream 59.189283 18.603612\n", "1 Dream 45.893746 17.438823\n", "2 Dream 48.175088 16.088652\n", "3 Dream 57.958338 17.529872\n", "4 Dream 51.722943 19.493872" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_synth_df.head()" ] }, { "cell_type": "markdown", "id": "f75ef712-3d58-4548-904e-6f1da72493f7", "metadata": {}, "source": [ "#### Estimate cost of training patectgan synthesizer with Smartnoise-Synth" ] }, { "cell_type": "code", "execution_count": 49, "id": "82ab4aee-b4af-4fbd-93cc-b8171f7f9a52", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.0, delta=0.00015673368198174188)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res_cost = client.smartnoise_synth.cost(\n", " synth_name=\"patectgan\",\n", " epsilon=1.0,\n", " select_cols = [\"island\", \"bill_length_mm\", \"bill_depth_mm\"],\n", ")\n", "res_cost" ] }, { "cell_type": "markdown", "id": "804e2119-1513-45b2-a989-f5566e6c7bf4", "metadata": {}, "source": [ "#### Train patectgan synthesizer on private data with Smartnoise-Synth" ] }, { "cell_type": "code", "execution_count": 50, "id": "fbc8b354-e4db-4472-957b-468e768eddc4", "metadata": {}, "outputs": [], "source": [ "res = client.smartnoise_synth.query(\n", " synth_name=\"patectgan\",\n", " select_cols = [\"island\", \"bill_length_mm\", \"bill_depth_mm\"],\n", " epsilon=1.0,\n", " dummy=False,\n", ")\n", "synth_df = res.result.df_samples" ] }, { "cell_type": "code", "execution_count": 51, "id": "550fa89d-9537-4daf-9f96-42fa71f242b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
islandbill_length_mmbill_depth_mm
0Dream52.05349315.699640
1Dream47.70572817.678879
2Torgersen56.87572716.196799
3Biscoe38.80793719.253387
4Dream46.33247715.361980
\n", "
" ], "text/plain": [ " island bill_length_mm bill_depth_mm\n", "0 Dream 52.053493 15.699640\n", "1 Dream 47.705728 17.678879\n", "2 Torgersen 56.875727 16.196799\n", "3 Biscoe 38.807937 19.253387\n", "4 Dream 46.332477 15.361980" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth_df.head()" ] }, { "cell_type": "markdown", "id": "b3ae2259-3d83-46c6-a59c-9df178d98dc5", "metadata": {}, "source": [ "Out of curiosity, she checks the average bill length and variance of bill length on this dataset." ] }, { "cell_type": "code", "execution_count": 52, "id": "7d5a336e-80f0-48fa-84a3-33d0e51a2d3b", "metadata": {}, "outputs": [], "source": [ "synth_mean = np.round(synth_df[\"bill_length_mm\"].mean(), 2)\n", "synth_variance = np.round(synth_df[\"bill_length_mm\"].var(), 2)" ] }, { "cell_type": "code", "execution_count": 53, "id": "e890a8d9-0c7b-4805-be8a-81e66d5fa7ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The average with Smartnoise-SQL on private data was 43.52.\n", "The average with Smartnoise-Synth on synthetic data is 44.53.\n" ] } ], "source": [ "print(\n", " f\"The average with Smartnoise-SQL on private data was {avg_bill_length}.\\n\"\n", " + f\"The average with Smartnoise-Synth on synthetic data is {synth_mean}.\"\n", ")" ] }, { "cell_type": "code", "execution_count": 54, "id": "11a14d9f-0fe3-4a5e-b425-d2192acd1e84", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The variance with opendp on private data was 35.06.\n", "The variance with Smartnoise-Synth on synthetic data is 36.25.\n" ] } ], "source": [ "print(\n", " f\"The variance with opendp on private data was {var_bill_length}.\\n\"\n", " + f\"The variance with Smartnoise-Synth on synthetic data is {synth_variance}.\"\n", ")" ] }, { "cell_type": "markdown", "id": "94eaf59b-c108-424c-8978-b1c86e141ccb", "metadata": {}, "source": [ "## Step 4: See archives of queries" ] }, { "cell_type": "markdown", "id": "64003c53-de56-4bdc-a3c2-0c3e40031919", "metadata": {}, "source": [ "She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses." ] }, { "cell_type": "code", "execution_count": 55, "id": "008fd230-cdfd-4e03-91ce-5a60b06c106d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "previous_queries = client.get_previous_queries()\n", "len(previous_queries)" ] }, { "cell_type": "code", "execution_count": 56, "id": "b712b269-64f2-4c7e-b8bf-d1a608933eff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'smartnoise_sql',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'query_str': 'SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df',\n", " 'epsilon': 0.5,\n", " 'delta': 0.0001,\n", " 'mechanisms': {},\n", " 'postprocess': True},\n", " 'response': {'epsilon': 1.0,\n", " 'delta': 4.999999999999449e-05,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'smartnoise_sql',\n", " 'df': {'index': [0],\n", " 'columns': ['avg_bill_length_mm'],\n", " 'data': [[43.75587056284081]],\n", " 'index_names': [None],\n", " 'column_names': [None]}}},\n", " 'timestamp': 1732024136.80644}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Smartnoise-SQL\n", "avg_bill_length_query = previous_queries[0]\n", "avg_bill_length_query" ] }, { "cell_type": "code", "execution_count": 57, "id": "8dfaf2b6-2b6c-480b-bcd7-250b0b2806a6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'smartnoise_sql',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'query_str': 'SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM df',\n", " 'epsilon': 0.5,\n", " 'delta': 0.0001,\n", " 'mechanisms': {},\n", " 'postprocess': True},\n", " 'response': {'epsilon': 1.0,\n", " 'delta': 4.999999999999449e-05,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'smartnoise_sql',\n", " 'df': {'index': [0],\n", " 'columns': ['avg_bill_length_mm'],\n", " 'data': [[43.52246114623609]],\n", " 'index_names': [None],\n", " 'column_names': [None]}}},\n", " 'timestamp': 1732024185.9737833}" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# OpenDP\n", "var_bill_length_query = previous_queries[1]\n", "var_bill_length_query" ] }, { "cell_type": "code", "execution_count": 58, "id": "376315ec-6f38-4919-959e-d6bf244a4952", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'opendp',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'opendp_json': Measurement(\n", " input_domain = AtomDomain(T=String),\n", " input_metric = SymmetricDistance(),\n", " output_measure = MaxDivergence(f64)),\n", " 'fixed_delta': None},\n", " 'response': {'epsilon': 0.7122093023265228,\n", " 'delta': 0.0,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'opendp', 'value': 35.063144457712596}},\n", " 'timestamp': 1732024233.7957816}" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# DiffPrivLib\n", "reg_bill_length_query = previous_queries[2]\n", "reg_bill_length_query" ] }, { "cell_type": "code", "execution_count": 59, "id": "638817e0-d88f-407a-8136-210309651cf2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'diffprivlib',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'diffprivlib_json': '{\"module\": \"diffprivlib\", \"version\": \"0.6.5\", \"pipeline\": [{\"type\": \"_dpl_type:RandomForestClassifier\", \"name\": \"rf\", \"params\": {\"n_estimators\": 10, \"n_jobs\": 1, \"random_state\": null, \"verbose\": 0, \"warm_start\": false, \"max_depth\": 5, \"epsilon\": 2.0, \"bounds\": {\"_tuple\": true, \"_items\": [[30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0]]}, \"classes\": [\"Adelie\", \"Chinstrap\", \"Gentoo\"], \"shuffle\": false, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}]}',\n", " 'feature_columns': ['bill_length_mm',\n", " 'bill_depth_mm',\n", " 'flipper_length_mm',\n", " 'body_mass_g'],\n", " 'target_columns': ['species'],\n", " 'test_size': 0.1,\n", " 'test_train_split_seed': 1,\n", " 'imputer_strategy': 'drop'},\n", " 'response': {'epsilon': 2.0,\n", " 'delta': 0.0,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'diffprivlib',\n", " 'score': 0.3235294117647059,\n", " 'model': Pipeline(steps=[('rf',\n", " RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " classes=['Adelie', 'Chinstrap',\n", " 'Gentoo'],\n", " epsilon=2.0))])}},\n", " 'timestamp': 1732024235.0577705}" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Smartnoise-Synth\n", "sysynth_query = previous_queries[3]\n", "sysynth_query" ] }, { "cell_type": "code", "execution_count": null, "id": "416d3afc-6455-4252-b883-2b6984233513", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "70cafc60-9ca5-46ca-83d9-2077a22a53dd", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }