{ "cells": [ { "cell_type": "markdown", "id": "3f18d338", "metadata": {}, "source": [ "# Lomas Client Side: Using DiffPrivlib" ] }, { "cell_type": "markdown", "id": "1582a2ae", "metadata": {}, "source": [ "This notebook showcases how researcher could use lomas platform with DiffPrivLib. It explains the different functionnalities provided by the `lomas-client` client library to interact with lomas server.\n", "\n", "The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.\n", "\n", "Each user has access to one or multiple projects and for each dataset has a limited budget with $\\epsilon$ and $\\delta$ values." ] }, { "cell_type": "markdown", "id": "5b73135c", "metadata": {}, "source": [ "In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data." ] }, { "cell_type": "markdown", "id": "01ae30d2", "metadata": {}, "source": [ "## Step 1: Install the library\n", "To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library `lomas-client` on her local developping environment. \n", "\n", "It can be installed via the pip command:" ] }, { "cell_type": "code", "execution_count": 1, "id": "8b340f40-32c9-487b-bc0c-a76593d43980", "metadata": {}, "outputs": [], "source": [ "# !pip install lomas_client" ] }, { "cell_type": "markdown", "id": "46c4f70b-1491-4162-930c-e0a86406ba69", "metadata": {}, "source": [ "Or using a local version of the client" ] }, { "cell_type": "code", "execution_count": 2, "id": "36d508bf-6cc3-4034-8e11-fffe858552f9", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "sys.path.append(os.path.abspath(os.path.join('..')))" ] }, { "cell_type": "code", "execution_count": 3, "id": "9535e92e-620e-4df4-92dd-4ea2c653e4ab", "metadata": {}, "outputs": [], "source": [ "from lomas_client import Client\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "9c63718b", "metadata": {}, "source": [ "## Step 2: Initialise the client\n", "\n", "Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server. \n", "\n", "To create the client, Dr. Antartica needs to give it a few parameters:\n", "- a url: the root application endpoint to the remote secure server.\n", "- user_name: her name as registered in the database (Dr. Alice Antartica)\n", "- dataset_name: the name of the dataset that she wants to query (PENGUIN)\n", "\n", "She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit (as is done in the Admin Notebook for Users and Datasets management)." ] }, { "cell_type": "code", "execution_count": 4, "id": "941991f7", "metadata": {}, "outputs": [], "source": [ "APP_URL = \"http://lomas_server\"\n", "USER_NAME = \"Dr. Antartica\"\n", "DATASET_NAME = \"PENGUIN\"\n", "client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)" ] }, { "cell_type": "markdown", "id": "0ec400c8", "metadata": {}, "source": [ "And that's it for the preparation. She is now ready to use the various functionnalities offered by `lomas-client`." ] }, { "cell_type": "markdown", "id": "9b9a5f13", "metadata": {}, "source": [ "## Step 3: Metadata and dummy dataset" ] }, { "cell_type": "markdown", "id": "c7cb5531", "metadata": {}, "source": [ "### Getting dataset metadata\n", "\n", "Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the `get_dataset_metadata()` function of the client. As this is public information, this does not cost any budget." ] }, { "cell_type": "code", "execution_count": 5, "id": "0fdebac9-57fc-4410-878b-5a77425af634", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'max_ids': 1,\n", " 'rows': 344,\n", " 'row_privacy': True,\n", " 'censor_dims': False,\n", " 'columns': {'species': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Adelie', 'Chinstrap', 'Gentoo']},\n", " 'island': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 3,\n", " 'categories': ['Torgersen', 'Biscoe', 'Dream']},\n", " 'bill_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 30.0,\n", " 'upper': 65.0},\n", " 'bill_depth_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 13.0,\n", " 'upper': 23.0},\n", " 'flipper_length_mm': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 150.0,\n", " 'upper': 250.0},\n", " 'body_mass_g': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'float',\n", " 'precision': 64,\n", " 'lower': 2000.0,\n", " 'upper': 7000.0},\n", " 'sex': {'private_id': False,\n", " 'nullable': False,\n", " 'max_partition_length': None,\n", " 'max_influenced_partitions': None,\n", " 'max_partition_contributions': None,\n", " 'type': 'string',\n", " 'cardinality': 2,\n", " 'categories': ['MALE', 'FEMALE']}}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "penguin_metadata = client.get_dataset_metadata()\n", "penguin_metadata" ] }, { "cell_type": "markdown", "id": "9e7ca7ae-bf17-40c8-aa75-2d72fcdd3088", "metadata": {}, "source": [ "## Step 4: Train Logistic Regression model with DiffPrivLib" ] }, { "cell_type": "markdown", "id": "2de1389c-53a7-4098-bc3c-397c12a4b869", "metadata": {}, "source": [ "We want to train an ML model to guess the species of penguins based on their bill length and depth, flipper length and body mass.\n", "\n", "Therefore, we use a DiffPrivLib pipeline which:\n", "- standard scales the dimensions between the metadata bounds\n", "- and then performs a logistic regression\n", "to predict the species of penguins." ] }, { "cell_type": "code", "execution_count": 6, "id": "2864729f-2ce4-4d81-a446-8e3f2c1493b3", "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from diffprivlib import models\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "a06365e9-4076-4592-871a-31af91d6a05d", "metadata": {}, "source": [ "### Classification: Logistic Regression" ] }, { "cell_type": "markdown", "id": "ea567662-6518-4c10-bd87-0fb6028db263", "metadata": {}, "source": [ "Dr. Antartica wants to do a logistic regression on the feature columns 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm' and'body_mass_g' to predict penguin species." ] }, { "cell_type": "code", "execution_count": 7, "id": "bda4884f-bce2-43b3-875e-dbb135492e79", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']\n", "target_columns = ['species']" ] }, { "cell_type": "markdown", "id": "2fe9db5e-9c57-41f3-a444-c9f67100ba81", "metadata": {}, "source": [ "#### She starts to write the associated DiffPrivLib pipeline and tries it on the dummy." ] }, { "cell_type": "markdown", "id": "eead3541-66e5-4c0f-aa5b-7b97821afe39", "metadata": {}, "source": [ "If the DiffprivlibCompatibilityWarning is raised by DiffPrivLib library, an warning will be raised the first time (as in DiffPrivLib) then the 'wrong' parameters will be ignored within the server." ] }, { "cell_type": "code", "execution_count": 8, "id": "804f31cd-f277-47d4-9648-a51872eccf29", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.12/site-packages/diffprivlib/utils.py:71: DiffprivlibCompatibilityWarning: Parameter 'svd_solver' is not functional in diffprivlib. Remove this parameter to suppress this warning.\n", " warnings.warn(f\"Parameter '{arg}' is not functional in diffprivlib. Remove this parameter to suppress this \"\n" ] } ], "source": [ "# DiffprivlibCompatibilityWarning Error expected\n", "dpl_pipeline = Pipeline([\n", " ('scaler', models.StandardScaler(epsilon = 0.5)),\n", " ('classifier', models.LogisticRegression(epsilon = 1.0, svd_solver='full'))\n", "])" ] }, { "cell_type": "markdown", "id": "15dd0c42-3e20-4b27-95b3-9b55622b4bfd", "metadata": {}, "source": [ "To resolve the DiffprivlibCompatibilityWarning issue, the svd_solver should not be set as it is incompatible with DiffPrivLib. If these warnings are ignore by the user, the default behaviour of DiffPrivLib will be applied." ] }, { "cell_type": "markdown", "id": "9c5d7fa4-cfe6-4a4f-88ff-8e88ca31dfed", "metadata": {}, "source": [ "If PrivacyLeakWarning are encountered, then the query will not be processed by the server and will return an error." ] }, { "cell_type": "code", "execution_count": 9, "id": "f1c9cffb-8327-400d-ab9d-35c5450fd4d6", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('scaler', models.StandardScaler(epsilon = 0.5)),\n", " ('classifier', models.LogisticRegression(epsilon = 1.0))\n", "])" ] }, { "cell_type": "code", "execution_count": 10, "id": "db1ddfbc-de3e-43fe-9958-49f9e6dad89f", "metadata": {}, "outputs": [ { "ename": "ExternalLibraryException", "evalue": "('diffprivlib', \"PrivacyLeakWarning: Bounds parameter hasn't been specified, so falling back to determining bounds from the data.\\n This will result in additional privacy leakage. To ensure differential privacy with no additional privacy loss, specify `bounds` for each valued returned by np.mean().. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.\")", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mExternalLibraryException\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[10], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Expect PrivacyLeakWarning Error\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m dummy_response \u001b[38;5;241m=\u001b[39m \u001b[43mclient\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdiffprivlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mpipeline\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mdpl_pipeline\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43mfeature_columns\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mfeature_columns\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget_columns\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mtarget_columns\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 7\u001b[0m \u001b[43m)\u001b[49m\n", "File \u001b[0;32m/code/lomas_client/libraries/diffprivlib.py:153\u001b[0m, in \u001b[0;36mDiffPrivLibClient.query\u001b[0;34m(self, pipeline, feature_columns, target_columns, test_size, test_train_split_seed, imputer_strategy, dummy, nb_rows, seed)\u001b[0m\n\u001b[1;32m 150\u001b[0m r_model \u001b[38;5;241m=\u001b[39m QueryResponse\u001b[38;5;241m.\u001b[39mmodel_validate_json(data)\n\u001b[1;32m 151\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m r_model\n\u001b[0;32m--> 153\u001b[0m \u001b[43mraise_error\u001b[49m\u001b[43m(\u001b[49m\u001b[43mres\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 154\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n", "File \u001b[0;32m/code/lomas_client/utils.py:38\u001b[0m, in \u001b[0;36mraise_error\u001b[0;34m(response)\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mInvalidQueryException\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[1;32m 37\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;241m==\u001b[39m status\u001b[38;5;241m.\u001b[39mHTTP_422_UNPROCESSABLE_ENTITY:\n\u001b[0;32m---> 38\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(\n\u001b[1;32m 39\u001b[0m error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mlibrary\u001b[39m\u001b[38;5;124m\"\u001b[39m], error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExternalLibraryException\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 40\u001b[0m )\n\u001b[1;32m 41\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;241m==\u001b[39m status\u001b[38;5;241m.\u001b[39mHTTP_403_FORBIDDEN:\n\u001b[1;32m 42\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mUnauthorizedAccessException\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n", "\u001b[0;31mExternalLibraryException\u001b[0m: ('diffprivlib', \"PrivacyLeakWarning: Bounds parameter hasn't been specified, so falling back to determining bounds from the data.\\n This will result in additional privacy leakage. To ensure differential privacy with no additional privacy loss, specify `bounds` for each valued returned by np.mean().. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.\")" ] } ], "source": [ "# Expect PrivacyLeakWarning Error\n", "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " dummy = True\n", ")" ] }, { "cell_type": "markdown", "id": "c9d58006-89d5-4110-9657-641256bafaf9", "metadata": {}, "source": [ "Diffprivlib requests that have **PrivacyLeakWarning** will not be processed in the server. \n", "In lomas, the bounds must always be specified. For most model, it is best to use **the standard scaler must always be used as a first step** and fill it based on the metadata values." ] }, { "cell_type": "code", "execution_count": 11, "id": "87384eb3-6b6c-44f8-b653-af471b234a2d", "metadata": {}, "outputs": [], "source": [ "def get_bounds(cols_metadata, columns):\n", " lower = [cols_metadata[col][\"lower\"] for col in columns]\n", " upper = [cols_metadata[col][\"upper\"] for col in columns]\n", " return (lower, upper)" ] }, { "cell_type": "code", "execution_count": 12, "id": "12d1faa1-f88a-49bb-911a-83f879ca10b6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)\n", "bounds" ] }, { "cell_type": "code", "execution_count": 13, "id": "1ca11e06-0b1e-4238-a5c1-1e52a6569431", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),\n", " ('classifier', models.LogisticRegression(epsilon = 1.0))\n", "])" ] }, { "cell_type": "code", "execution_count": 14, "id": "235d7da6-f6bd-4e71-9b07-84cbe283b74e", "metadata": {}, "outputs": [ { "ename": "ExternalLibraryException", "evalue": "('diffprivlib', 'PrivacyLeakWarning: Data norm has not been specified and will be calculated on the data provided. This will result in additional privacy leakage. To ensure differential privacy and no additional privacy leakage, specify `data_norm` at initialisation.. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.')", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mExternalLibraryException\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[14], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Expect PrivacyLeakWarning Error\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m dummy_response \u001b[38;5;241m=\u001b[39m \u001b[43mclient\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mdiffprivlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mquery\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mpipeline\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mdpl_pipeline\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43mfeature_columns\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mfeature_columns\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mtarget_columns\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mtarget_columns\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m \u001b[49m\u001b[43mdummy\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 7\u001b[0m \u001b[43m)\u001b[49m\n", "File \u001b[0;32m/code/lomas_client/libraries/diffprivlib.py:153\u001b[0m, in \u001b[0;36mDiffPrivLibClient.query\u001b[0;34m(self, pipeline, feature_columns, target_columns, test_size, test_train_split_seed, imputer_strategy, dummy, nb_rows, seed)\u001b[0m\n\u001b[1;32m 150\u001b[0m r_model \u001b[38;5;241m=\u001b[39m QueryResponse\u001b[38;5;241m.\u001b[39mmodel_validate_json(data)\n\u001b[1;32m 151\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m r_model\n\u001b[0;32m--> 153\u001b[0m \u001b[43mraise_error\u001b[49m\u001b[43m(\u001b[49m\u001b[43mres\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 154\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n", "File \u001b[0;32m/code/lomas_client/utils.py:38\u001b[0m, in \u001b[0;36mraise_error\u001b[0;34m(response)\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidQueryException(error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mInvalidQueryException\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[1;32m 37\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;241m==\u001b[39m status\u001b[38;5;241m.\u001b[39mHTTP_422_UNPROCESSABLE_ENTITY:\n\u001b[0;32m---> 38\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m ExternalLibraryException(\n\u001b[1;32m 39\u001b[0m error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mlibrary\u001b[39m\u001b[38;5;124m\"\u001b[39m], error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mExternalLibraryException\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 40\u001b[0m )\n\u001b[1;32m 41\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m response\u001b[38;5;241m.\u001b[39mstatus_code \u001b[38;5;241m==\u001b[39m status\u001b[38;5;241m.\u001b[39mHTTP_403_FORBIDDEN:\n\u001b[1;32m 42\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m UnauthorizedAccessException(error_message[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mUnauthorizedAccessException\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n", "\u001b[0;31mExternalLibraryException\u001b[0m: ('diffprivlib', 'PrivacyLeakWarning: Data norm has not been specified and will be calculated on the data provided. This will result in additional privacy leakage. To ensure differential privacy and no additional privacy leakage, specify `data_norm` at initialisation.. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.')" ] } ], "source": [ "# Expect PrivacyLeakWarning Error\n", "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " dummy = True\n", ")" ] }, { "cell_type": "markdown", "id": "c47aec3e-15ea-476d-9dcb-5b2f4888e355", "metadata": {}, "source": [ "Again, we have a Privacy Leak. For the same reason, the data_norm should be computed based on metadata and given as argument as explained in the error message." ] }, { "cell_type": "code", "execution_count": 15, "id": "25a72ad9-d3f3-478a-8d33-c07bedbf4f66", "metadata": {}, "outputs": [], "source": [ "# The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.\n", "data_norm = np.sqrt(np.linalg.norm(bounds[1]))" ] }, { "cell_type": "code", "execution_count": 16, "id": "828a743e-960f-4713-a6ed-a0bd243a13e3", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),\n", " ('classifier', models.LogisticRegression(epsilon = 1.0, data_norm = data_norm))\n", "])" ] }, { "cell_type": "code", "execution_count": 17, "id": "c9c13dfe-0266-4cd8-b126-2accee6c1136", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " dummy = True\n", ")" ] }, { "cell_type": "markdown", "id": "2ccdb7f5-5968-47e7-99d4-16067b8645a2", "metadata": {}, "source": [ "The pipeline worked, she can check that she has a dummy model and a dummy score associated. In the case of a Logistic Regression the score is a mean accuracy as specified [here](https://diffprivlib.readthedocs.io/en/latest/modules/models.html#diffprivlib.models.LogisticRegression.score).\n", "Each model return an associated score. The associated documentation is in the DiffPrivLib documentation in the `score` method of each model." ] }, { "cell_type": "code", "execution_count": 20, "id": "3c38f919-7ca1-455c-9b0f-bc2b56f60c0f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('scaler',\n",
       "                 StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),\n",
       "                                bounds=(array([  30.,   13.,  150., 2000.]),\n",
       "                                        array([  65.,   23.,  250., 7000.])),\n",
       "                                epsilon=0.5)),\n",
       "                ('classifier',\n",
       "                 LogisticRegression(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n",
       "                                    data_norm=83.69469642643347))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('scaler',\n", " StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " epsilon=0.5)),\n", " ('classifier',\n", " LogisticRegression(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n", " data_norm=83.69469642643347))])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_response.result.model" ] }, { "cell_type": "markdown", "id": "6a61629d-b96f-4114-8bc2-eccf34d69d2b", "metadata": {}, "source": [ "Now that the pipeline seems to work, she also wants to choose another data imputation method: be default the missing data are dropped but she wants the replace them with the mean. Therefore, she uses the `imputer_strategy` argument." ] }, { "cell_type": "code", "execution_count": 21, "id": "547935ca-0932-4476-8ace-3644c7e0a08d", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " imputer_strategy = \"mean\",\n", " dummy = True\n", ")" ] }, { "cell_type": "markdown", "id": "4587385a-b76e-45f4-a1d9-af0a88a1181f", "metadata": {}, "source": [ "It also works. It she wanted she could replace by the mean value with `imputer_strategy = \"mean\"` or the most frequent value with `imputer_strategy = \"most_frequent\"` (most_frequent makes more sense in the case of categorical columns). " ] }, { "cell_type": "markdown", "id": "381cd096-bde3-474b-b24d-7fcb4c6ae49f", "metadata": {}, "source": [ "Finally, she wants to use as much data as possible to train the model so she decides to reduce the `test_size` to 0.1 (meaning that 10% of the data will be used as the test set and 90% and the training set). Also she modifies the seed for the random split between training and testing data `test_train_split_seed` because why not. By default `test_size = 0.2` and `test_train_split_seed = 1`." ] }, { "cell_type": "code", "execution_count": 22, "id": "73e8b9ab-2d93-4333-ace3-c94c714410dc", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.1,\n", " test_train_split_seed = 4,\n", " imputer_strategy = \"mean\",\n", " dummy = True\n", ")" ] }, { "cell_type": "markdown", "id": "4f9c2330-15f3-46f9-94c1-d62ea350e2e8", "metadata": {}, "source": [ "#### She can now estimated the cost of this pipeline" ] }, { "cell_type": "code", "execution_count": 23, "id": "e036b55f-6c3d-4212-8d91-9ece8223cf69", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.5, delta=0.0)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = client.diffprivlib.cost(\n", " dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.1,\n", " test_train_split_seed = 4,\n", " imputer_strategy = \"mean\",\n", ")\n", "res" ] }, { "cell_type": "code", "execution_count": 25, "id": "4398755f-348f-47e2-9329-80c34897e16b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The cost will be 1.5 epsilon and 0.0 delta.'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"The cost will be {res.epsilon} epsilon and {res.delta} delta.\"" ] }, { "cell_type": "markdown", "id": "8aad5f86-20f7-42fe-8efd-56c96f6beb41", "metadata": {}, "source": [ "Now we train the same pipeline on the real dataset." ] }, { "cell_type": "code", "execution_count": 26, "id": "b823db85-d993-4a36-9dea-bf5989c543f8", "metadata": {}, "outputs": [], "source": [ "res = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.1,\n", " test_train_split_seed = 4,\n", " imputer_strategy = \"mean\",\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "id": "6c071425-5364-4d79-8ccf-4f46018ac849", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'The accuracy score of the model trained on real data is 0.6.'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f\"The accuracy score of the model trained on real data is {res.result.score}.\"" ] }, { "cell_type": "markdown", "id": "9cea3fc7-4331-4387-a19f-6afcaa21b6bc", "metadata": {}, "source": [ "The model is with different trained parameters is also available:" ] }, { "cell_type": "code", "execution_count": 29, "id": "b646cd08-504b-4846-88ba-a3557ee4b2fd", "metadata": {}, "outputs": [], "source": [ "model = res.result.model" ] }, { "cell_type": "markdown", "id": "a581f47d-f1c7-4121-9f8f-e496e9023d58", "metadata": {}, "source": [ "We predict what would be the specie of the smallest possible penguin in all dimension versus to biggest possible penguin in all dimensions." ] }, { "cell_type": "code", "execution_count": 30, "id": "65413ef0-317b-431d-9ed8-6ffabe85c1b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gpredictions
030.013.0150.02000.0Adelie
165.023.0250.07000.0Gentoo
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g predictions\n", "0 30.0 13.0 150.0 2000.0 Adelie\n", "1 65.0 23.0 250.0 7000.0 Gentoo" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bounds[0][0], bounds[1][0]], \n", " 'bill_depth_mm': [bounds[0][1], bounds[1][1]] , \n", " 'flipper_length_mm': [bounds[0][2], bounds[1][2]], \n", " 'body_mass_g': [bounds[0][3], bounds[1][3]]\n", "})\n", "\n", "predictions = model.predict(x_to_predict)\n", "x_to_predict[\"predictions\"] = predictions\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "839ba199-3014-4e39-8f0a-aecf15241b11", "metadata": {}, "source": [ "## Step 5: Train other models with DiffPrivLib" ] }, { "cell_type": "markdown", "id": "4f161f96-e240-45b2-bb60-002870196158", "metadata": {}, "source": [ "The logic is always the same for all the models. The `pipeline` and `feature_columns` arguments must always be specified for all models. The `target_columns` must be specified except for Clustering (K-Means) and Dimensinnality reduction (PCA).\n", "\n", "Here are examples of each on dummy dataframes." ] }, { "cell_type": "markdown", "id": "fef35343-cdf6-4e86-a54a-4bcf6568f2ac", "metadata": {}, "source": [ "### Classification: Gaussian Naive Bayes" ] }, { "cell_type": "code", "execution_count": 31, "id": "f8a5ddc0-61bb-44c0-a14b-5ac3278ff539", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']\n", "target_columns = ['species']" ] }, { "cell_type": "code", "execution_count": 32, "id": "4f4b9789-7da1-4abe-9dd6-dd1472a4e217", "metadata": {}, "outputs": [], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)" ] }, { "cell_type": "code", "execution_count": 33, "id": "9dd37857-0ef2-44d5-b206-092262c9cea1", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),\n", " ('gaussian', models.GaussianNB(epsilon = 1.0, bounds=bounds, priors = (0.3, 0.3, 0.4))),\n", "])" ] }, { "cell_type": "code", "execution_count": 34, "id": "8e2c07c6-20f4-473c-80f6-b4b558191530", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.15,\n", " imputer_strategy = \"median\",\n", " dummy = True\n", ")" ] }, { "cell_type": "code", "execution_count": 35, "id": "550868df-5476-496e-849d-8787da9eb27e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=1.5, delta=0.0)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost_res = client.diffprivlib.cost(\n", " dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.15,\n", " imputer_strategy = \"median\",\n", ")\n", "cost_res" ] }, { "cell_type": "code", "execution_count": 36, "id": "934a1237-68d1-4796-838e-e8c370068f5d", "metadata": {}, "outputs": [], "source": [ "response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " imputer_strategy = \"median\",\n", " test_size = 0.15,\n", ")" ] }, { "cell_type": "code", "execution_count": 37, "id": "81708812-897a-4357-af06-139220538831", "metadata": {}, "outputs": [], "source": [ "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bounds[0][0], bounds[1][0]], \n", " 'bill_depth_mm': [bounds[0][1], bounds[1][1]] , \n", " 'flipper_length_mm': [bounds[0][2], bounds[1][2]], \n", "})" ] }, { "cell_type": "code", "execution_count": 39, "id": "ec3c98d3-cb5d-4a96-9ada-ea9904326aec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmpredictions
030.013.0150.0Chinstrap
165.023.0250.0Chinstrap
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm predictions\n", "0 30.0 13.0 150.0 Chinstrap\n", "1 65.0 23.0 250.0 Chinstrap" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions = response.result.model.predict(x_to_predict)\n", "x_to_predict[\"predictions\"] = predictions\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "c1224c94-d995-40f1-967d-b7afb21bce48", "metadata": {}, "source": [ "### Random Forest" ] }, { "cell_type": "code", "execution_count": 40, "id": "980f4494-4b1f-4cb0-b288-ddc672231b1f", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'body_mass_g']\n", "target_columns = ['island']" ] }, { "cell_type": "code", "execution_count": 41, "id": "e91f9a5c-14ea-4e48-8e42-fa50152767b7", "metadata": {}, "outputs": [], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)" ] }, { "cell_type": "code", "execution_count": 42, "id": "d4f036a7-0405-4f4b-a4a0-3fe1b0b8d4d0", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " (\n", " 'rf', \n", " models.RandomForestClassifier(\n", " n_estimators=10, \n", " epsilon = 2.0, \n", " bounds=bounds, \n", " classes=penguin_metadata['columns']['island']['categories']\n", " )\n", " ),\n", "])" ] }, { "cell_type": "code", "execution_count": 43, "id": "c969d488-e1a2-445a-a269-681d96343b9f", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " imputer_strategy = \"drop\", #default\n", " dummy = True\n", ")" ] }, { "cell_type": "code", "execution_count": 44, "id": "de94f0a5-6e9d-4eac-aa77-b98892ed41fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CostResponse(epsilon=2.0, delta=0.0)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cost_res = client.diffprivlib.cost(\n", " dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " imputer_strategy = \"drop\", #default\n", ")\n", "cost_res" ] }, { "cell_type": "code", "execution_count": 45, "id": "40f25e33-a644-4938-8adb-a7bbe5306ee2", "metadata": {}, "outputs": [], "source": [ "response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " imputer_strategy = \"drop\", #default\n", ")" ] }, { "cell_type": "code", "execution_count": 46, "id": "4f5fad30-991e-45f3-a7e3-1ae04bedd8a1", "metadata": {}, "outputs": [], "source": [ "model = response.result.model" ] }, { "cell_type": "code", "execution_count": 47, "id": "96562928-73e8-48b1-b8a5-97245c7da8d2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmbody_mass_gpredictions
030.013.02000.0Biscoe
165.023.07000.0Torgersen
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm body_mass_g predictions\n", "0 30.0 13.0 2000.0 Biscoe\n", "1 65.0 23.0 7000.0 Torgersen" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bounds[0][0], bounds[1][0]], \n", " 'bill_depth_mm': [bounds[0][1], bounds[1][1]] , \n", " 'body_mass_g': [bounds[0][2], bounds[1][2]]\n", "})\n", "predictions = model.predict(x_to_predict)\n", "x_to_predict[\"predictions\"] = predictions\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "020ceeb1-7a78-4036-8ece-0afa88e49342", "metadata": {}, "source": [ "### Decision Tree Classifier" ] }, { "cell_type": "code", "execution_count": 48, "id": "c5400d29-8303-4714-8617-8a2b26b0aa2e", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'body_mass_g']\n", "target_columns = ['species']" ] }, { "cell_type": "code", "execution_count": 49, "id": "1760f475-6326-41e5-945a-66882183bc00", "metadata": {}, "outputs": [], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)" ] }, { "cell_type": "code", "execution_count": 50, "id": "22b5c5ae-b008-4980-8767-47e7a1ad3301", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " (\n", " 'dtc', \n", " models.DecisionTreeClassifier(\n", " epsilon = 2.0, \n", " bounds=bounds, \n", " classes=penguin_metadata['columns']['species']['categories']\n", " )\n", " ),\n", "])" ] }, { "cell_type": "code", "execution_count": 51, "id": "70b73ba0-67dc-4594-820c-0128916fafc8", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.2,\n", " test_train_split_seed = 1,\n", " dummy = True,\n", " nb_rows = 100,\n", " seed = 42\n", ")" ] }, { "cell_type": "code", "execution_count": 52, "id": "736e8283-604a-4e45-9e98-7da5cc03e1b0", "metadata": {}, "outputs": [], "source": [ "response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " test_size = 0.2,\n", ")" ] }, { "cell_type": "code", "execution_count": 53, "id": "66bfe1a9-7982-4d6b-ac0b-32a4412f29bf", "metadata": {}, "outputs": [], "source": [ "model = response.result.model" ] }, { "cell_type": "code", "execution_count": 54, "id": "a3719359-58a6-4271-9a34-4b49e0d7188e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbody_mass_gpredictions
030.02000.0Gentoo
165.07000.0Chinstrap
\n", "
" ], "text/plain": [ " bill_length_mm body_mass_g predictions\n", "0 30.0 2000.0 Gentoo\n", "1 65.0 7000.0 Chinstrap" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bounds[0][0], bounds[1][0]], \n", " 'body_mass_g': [bounds[0][1], bounds[1][1]] , \n", "})\n", "x_to_predict[\"predictions\"] = model.predict(x_to_predict)\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "4e72e9fc-bb5b-49d9-b4e6-543bdaf93f69", "metadata": {}, "source": [ "### Regression: Linear Regression" ] }, { "cell_type": "code", "execution_count": 55, "id": "c893cbb0-26b7-4367-ab67-1ccc88f951ed", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm']\n", "target_columns = ['bill_depth_mm']" ] }, { "cell_type": "code", "execution_count": 56, "id": "c7c66418-74b3-4887-bb99-3d8b6ea7af3d", "metadata": {}, "outputs": [], "source": [ "bill_length_meta = penguin_metadata['columns']['bill_length_mm']\n", "bill_depth_meta = penguin_metadata['columns']['bill_depth_mm']" ] }, { "cell_type": "code", "execution_count": 57, "id": "907e6f6c-a3da-4f7c-8d85-b0a7b8c018d1", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " (\n", " 'lr', \n", " models.LinearRegression(\n", " epsilon = 2.0, \n", " bounds_X=(bill_length_meta['lower'], bill_length_meta['upper']), \n", " bounds_y=(bill_depth_meta['lower'], bill_depth_meta['upper'])\n", " )\n", " ),\n", "])" ] }, { "cell_type": "code", "execution_count": 58, "id": "58011a7d-7c03-4872-a032-5d93c55c1d5f", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " target_columns = target_columns,\n", " dummy = True\n", ")\n", "model = dummy_response.result.model" ] }, { "cell_type": "code", "execution_count": 59, "id": "e28b1bd0-0e73-49db-a2d3-e7f06dc39d49", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmpredictions
030.017.985419
165.017.489243
\n", "
" ], "text/plain": [ " bill_length_mm predictions\n", "0 30.0 17.985419\n", "1 65.0 17.489243" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dummy model predictions\n", "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bill_length_meta['lower'], bill_length_meta['upper']], \n", "})\n", "x_to_predict[\"predictions\"] = model.predict(x_to_predict)\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "ec9e921d-1640-4379-84e5-65b93a1ff203", "metadata": {}, "source": [ "### Clustering: K-Means" ] }, { "cell_type": "code", "execution_count": 60, "id": "2491d06f-3c40-4649-b419-ccd9b39f0764", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']" ] }, { "cell_type": "code", "execution_count": 61, "id": "10acc43d-ff2e-4d7a-9c52-361ffb80e0da", "metadata": {}, "outputs": [], "source": [ "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)" ] }, { "cell_type": "code", "execution_count": 62, "id": "1e830a01-7800-4701-aef2-11dbfc517f61", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " ('kmeans', models.KMeans(n_clusters = 8, epsilon = 2.0, bounds=bounds)),\n", "])" ] }, { "cell_type": "code", "execution_count": 63, "id": "cb5e3e1f-2d48-4d5e-992a-c0e0cb6add3b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('kmeans',\n",
       "                 KMeans(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n",
       "                        bounds=(array([  30.,   13.,  150., 2000.]),\n",
       "                                array([  65.,   23.,  250., 7000.])),\n",
       "                        epsilon=2.0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('kmeans',\n", " KMeans(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " epsilon=2.0))])" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " dummy = True\n", ")\n", "model = dummy_response.result.model\n", "model" ] }, { "cell_type": "code", "execution_count": 64, "id": "a0612b10-de7a-4355-8268-09a46c30056a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gpredictions
030.013.0150.02000.03
165.023.0250.07000.02
\n", "
" ], "text/plain": [ " bill_length_mm bill_depth_mm flipper_length_mm body_mass_g predictions\n", "0 30.0 13.0 150.0 2000.0 3\n", "1 65.0 23.0 250.0 7000.0 2" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dummy model predictions\n", "x_to_predict = pd.DataFrame({\n", " 'bill_length_mm': [bounds[0][0], bounds[1][0]], \n", " 'bill_depth_mm': [bounds[0][1], bounds[1][1]] , \n", " 'flipper_length_mm': [bounds[0][2], bounds[1][2]], \n", " 'body_mass_g': [bounds[0][3], bounds[1][3]]\n", "})\n", "x_to_predict[\"predictions\"] = model.predict(x_to_predict)\n", "x_to_predict" ] }, { "cell_type": "markdown", "id": "6021b034-b15d-4826-8de3-80b99afc838d", "metadata": {}, "source": [ "### Dimensionality Reduction: PCA" ] }, { "cell_type": "code", "execution_count": 65, "id": "fd8347be-5951-419d-820c-26655db5ea8c", "metadata": {}, "outputs": [], "source": [ "feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']\n", "bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)" ] }, { "cell_type": "code", "execution_count": 66, "id": "f9a4cbaf-b994-41c0-a0e9-81674c6657fb", "metadata": {}, "outputs": [], "source": [ "dpl_pipeline = Pipeline([\n", " (\n", " 'pca', \n", " models.PCA(\n", " n_components=None, \n", " epsilon = 1.0, \n", " bounds=bounds, \n", " data_norm=100, \n", " centered=False\n", " )\n", " ),\n", "])" ] }, { "cell_type": "code", "execution_count": 67, "id": "4681cc17-6285-4b9b-a26d-837d6a79e1c3", "metadata": {}, "outputs": [], "source": [ "dummy_response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", " dummy = True\n", ")\n", "model = dummy_response.result.model" ] }, { "cell_type": "code", "execution_count": 69, "id": "e2d6912e-f97c-4ac6-9085-c7379136628c", "metadata": {}, "outputs": [], "source": [ "response = client.diffprivlib.query(\n", " pipeline = dpl_pipeline,\n", " feature_columns = feature_columns,\n", ")\n", "model = response.result.model" ] }, { "cell_type": "code", "execution_count": 70, "id": "e2c0cafa-49f6-48de-aa1a-4e84bbbf97cf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
PCA(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n",
       "    bounds=(array([  30.,   13.,  150., 2000.]),\n",
       "            array([  65.,   23.,  250., 7000.])),\n",
       "    data_norm=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "PCA(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " data_norm=100)" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model = model.steps[0][1]\n", "pca_model" ] }, { "cell_type": "code", "execution_count": 71, "id": "b1ac9784-bd0e-41c6-bb54-8201d9ab13ad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.08124548, -0.11603131, -0.06907006, 0.98750455],\n", " [-0.37988112, 0.74377432, 0.54189149, 0.09404104],\n", " [-0.11526053, -0.62023805, 0.77538936, -0.0281267 ],\n", " [ 0.91422345, 0.22054765, 0.31678744, 0.12328802]])" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.components_" ] }, { "cell_type": "code", "execution_count": 72, "id": "d7cf2790-1938-48b1-83b3-3b17b06953fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([8914.48046055, 1029.95283494, 241.10575291, 94.79455338])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.explained_variance_" ] }, { "cell_type": "code", "execution_count": 73, "id": "91992588-2eb5-4b6b-9f30-ccaa272c593a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.86713922, 0.10018671, 0.02345311, 0.00922096])" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.explained_variance_ratio_" ] }, { "cell_type": "code", "execution_count": 74, "id": "9e76a632-114b-428f-ad78-d83fdcd00d56", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1536.98969484, 252.77069553, 158.4946581 , 522.43420759])" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.singular_values_" ] }, { "cell_type": "code", "execution_count": 75, "id": "fc13910b-b892-4a46-bff9-c3df4cc6ed72", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 43.71211178, 16.5604082 , 189.09819947, 4237.9468197 ])" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.mean_" ] }, { "cell_type": "code", "execution_count": 76, "id": "bd11a051-1452-4b59-8c18-622737986125", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.n_components_" ] }, { "cell_type": "code", "execution_count": 77, "id": "282d55e9-0968-4528-b2d4-4d87d64d4c1a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pca_model.noise_variance_" ] }, { "cell_type": "markdown", "id": "94eaf59b-c108-424c-8978-b1c86e141ccb", "metadata": {}, "source": [ "## Step 6: See archives of queries" ] }, { "cell_type": "markdown", "id": "64003c53-de56-4bdc-a3c2-0c3e40031919", "metadata": {}, "source": [ "She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses." ] }, { "cell_type": "code", "execution_count": 78, "id": "008fd230-cdfd-4e03-91ce-5a60b06c106d", "metadata": {}, "outputs": [], "source": [ "previous_queries = client.get_previous_queries()" ] }, { "cell_type": "code", "execution_count": 79, "id": "1795a54b-d04e-4687-8649-93982c84ad30", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'diffprivlib',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'diffprivlib_json': '{\"module\": \"diffprivlib\", \"version\": \"0.6.4\", \"pipeline\": [{\"type\": \"_dpl_type:StandardScaler\", \"name\": \"scaler\", \"params\": {\"with_mean\": true, \"with_std\": true, \"copy\": true, \"epsilon\": 0.5, \"bounds\": {\"_tuple\": true, \"_items\": [[30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0]]}, \"random_state\": null, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}, {\"type\": \"_dpl_type:LogisticRegression\", \"name\": \"classifier\", \"params\": {\"tol\": 0.0001, \"C\": 1.0, \"fit_intercept\": true, \"random_state\": null, \"max_iter\": 100, \"verbose\": 0, \"warm_start\": false, \"n_jobs\": null, \"epsilon\": 1.0, \"data_norm\": 83.69469642643347, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}]}',\n", " 'feature_columns': ['bill_length_mm',\n", " 'bill_depth_mm',\n", " 'flipper_length_mm',\n", " 'body_mass_g'],\n", " 'target_columns': ['species'],\n", " 'test_size': 0.1,\n", " 'test_train_split_seed': 4,\n", " 'imputer_strategy': 'mean'},\n", " 'response': {'epsilon': 1.5,\n", " 'delta': 0.0,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'diffprivlib',\n", " 'score': 0.6,\n", " 'model': Pipeline(steps=[('scaler',\n", " StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),\n", " bounds=(array([ 30., 13., 150., 2000.]),\n", " array([ 65., 23., 250., 7000.])),\n", " epsilon=0.5)),\n", " ('classifier',\n", " LogisticRegression(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n", " data_norm=83.69469642643347))])}},\n", " 'timestamp': 1728464751.4143507}" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_1 = previous_queries[0]\n", "query_1" ] }, { "cell_type": "code", "execution_count": 80, "id": "ef251e47-67d8-426b-9655-c16d32778579", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'diffprivlib',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'diffprivlib_json': '{\"module\": \"diffprivlib\", \"version\": \"0.6.4\", \"pipeline\": [{\"type\": \"_dpl_type:StandardScaler\", \"name\": \"scaler\", \"params\": {\"with_mean\": true, \"with_std\": true, \"copy\": true, \"epsilon\": 0.5, \"bounds\": {\"_tuple\": true, \"_items\": [[30.0, 13.0, 150.0], [65.0, 23.0, 250.0]]}, \"random_state\": null, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}, {\"type\": \"_dpl_type:GaussianNB\", \"name\": \"gaussian\", \"params\": {\"priors\": {\"_tuple\": true, \"_items\": [0.3, 0.3, 0.4]}, \"var_smoothing\": 1e-09, \"epsilon\": 1.0, \"bounds\": {\"_tuple\": true, \"_items\": [[30.0, 13.0, 150.0], [65.0, 23.0, 250.0]]}, \"random_state\": null, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}]}',\n", " 'feature_columns': ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'],\n", " 'target_columns': ['species'],\n", " 'test_size': 0.15,\n", " 'test_train_split_seed': 1,\n", " 'imputer_strategy': 'median'},\n", " 'response': {'epsilon': 1.5,\n", " 'delta': 0.0,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'diffprivlib',\n", " 'score': 0.17307692307692307,\n", " 'model': Pipeline(steps=[('scaler',\n", " StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),\n", " bounds=(array([ 30., 13., 150.]),\n", " array([ 65., 23., 250.])),\n", " epsilon=0.5)),\n", " ('gaussian',\n", " GaussianNB(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),\n", " bounds=(array([ 30., 13., 150.]),\n", " array([ 65., 23., 250.])),\n", " priors=(0.3, 0.3, 0.4)))])}},\n", " 'timestamp': 1728464776.4250495}" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_2 = previous_queries[1]\n", "query_2" ] }, { "cell_type": "code", "execution_count": 81, "id": "b2fa8943-f0e0-4902-9001-bd10fb22a653", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'user_name': 'Dr. Antartica',\n", " 'dataset_name': 'PENGUIN',\n", " 'dp_librairy': 'diffprivlib',\n", " 'client_input': {'dataset_name': 'PENGUIN',\n", " 'diffprivlib_json': '{\"module\": \"diffprivlib\", \"version\": \"0.6.4\", \"pipeline\": [{\"type\": \"_dpl_type:RandomForestClassifier\", \"name\": \"rf\", \"params\": {\"n_estimators\": 10, \"n_jobs\": 1, \"random_state\": null, \"verbose\": 0, \"warm_start\": false, \"max_depth\": 5, \"epsilon\": 2.0, \"bounds\": {\"_tuple\": true, \"_items\": [[30.0, 13.0, 2000.0], [65.0, 23.0, 7000.0]]}, \"classes\": [\"Torgersen\", \"Biscoe\", \"Dream\"], \"shuffle\": false, \"accountant\": \"_dpl_instance:BudgetAccountant\"}}]}',\n", " 'feature_columns': ['bill_length_mm', 'bill_depth_mm', 'body_mass_g'],\n", " 'target_columns': ['island'],\n", " 'test_size': 0.2,\n", " 'test_train_split_seed': 1,\n", " 'imputer_strategy': 'drop'},\n", " 'response': {'epsilon': 2.0,\n", " 'delta': 0.0,\n", " 'requested_by': 'Dr. Antartica',\n", " 'result': {'res_type': 'diffprivlib',\n", " 'score': 0.417910447761194,\n", " 'model': Pipeline(steps=[('rf',\n", " RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),\n", " bounds=(array([ 30., 13., 2000.]),\n", " array([ 65., 23., 7000.])),\n", " classes=['Torgersen', 'Biscoe',\n", " 'Dream'],\n", " epsilon=2.0))])}},\n", " 'timestamp': 1728464804.2231035}" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query_3 = previous_queries[2]\n", "query_3" ] }, { "cell_type": "code", "execution_count": null, "id": "e2c8d40d-94b3-4d69-af99-0ec2936f233e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }