Lomas Client Side: Using DiffPrivlib

This notebook showcases how researcher could use lomas platform with DiffPrivLib. It explains the different functionnalities provided by the lomas-client client library to interact with lomas server.

The secure data are never visible by researchers. They can only access to differentially private responses via queries to the server.

Each user has access to one or multiple projects and for each dataset has a limited budget with \(\epsilon\) and \(\delta\) values.

In this notebook the researcher is a penguin researcher named Dr. Antarctica. She aims to do a grounbdbreaking research on various penguins data.

Step 1: Install the library

To interact with the secure server on which the data is stored, Dr.Antartica first needs to install the library lomas-client on her local developping environment.

It can be installed via the pip command:

[1]:
# !pip install lomas_client

Or using a local version of the client

[2]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))
[3]:
from lomas_client import Client
import numpy as np

Step 2: Initialise the client

Once the library is installed, a Client object must be created. It is responsible for sending sending requests to the server and processing responses in the local environment. It enables a seamless interaction with the server.

To create the client, Dr. Antartica needs to give it a few parameters: - a url: the root application endpoint to the remote secure server. - user_name: her name as registered in the database (Dr. Alice Antartica) - dataset_name: the name of the dataset that she wants to query (PENGUIN)

She will only be able to query on the real dataset if the queen Icergina has previously made her an account in the database, given her access to the PENGUIN dataset and has given her some epsilon and delta credit (as is done in the Admin Notebook for Users and Datasets management).

[4]:
APP_URL = "http://lomas_server"
USER_NAME = "Dr. Antartica"
DATASET_NAME = "PENGUIN"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)

And that’s it for the preparation. She is now ready to use the various functionnalities offered by lomas-client.

Step 3: Metadata and dummy dataset

Getting dataset metadata

Dr. Antartica has never seen the data and as a first step to understand what is available to her, she would like to check the metadata of the dataset. Therefore, she just needs to call the get_dataset_metadata() function of the client. As this is public information, this does not cost any budget.

[5]:
penguin_metadata = client.get_dataset_metadata()
penguin_metadata
[5]:
{'max_ids': 1,
 'rows': 344,
 'row_privacy': True,
 'censor_dims': False,
 'columns': {'species': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Adelie', 'Chinstrap', 'Gentoo']},
  'island': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 3,
   'categories': ['Torgersen', 'Biscoe', 'Dream']},
  'bill_length_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 30.0,
   'upper': 65.0},
  'bill_depth_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 13.0,
   'upper': 23.0},
  'flipper_length_mm': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 150.0,
   'upper': 250.0},
  'body_mass_g': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'float',
   'precision': 64,
   'lower': 2000.0,
   'upper': 7000.0},
  'sex': {'private_id': False,
   'nullable': False,
   'max_partition_length': None,
   'max_influenced_partitions': None,
   'max_partition_contributions': None,
   'type': 'string',
   'cardinality': 2,
   'categories': ['MALE', 'FEMALE']}}}

Step 4: Train Logistic Regression model with DiffPrivLib

We want to train an ML model to guess the species of penguins based on their bill length and depth, flipper length and body mass.

Therefore, we use a DiffPrivLib pipeline which: - standard scales the dimensions between the metadata bounds - and then performs a logistic regression to predict the species of penguins.

[6]:
from sklearn.pipeline import Pipeline
from diffprivlib import models
import pandas as pd

Classification: Logistic Regression

Dr. Antartica wants to do a logistic regression on the feature columns ‘bill_length_mm’, ‘bill_depth_mm’, ‘flipper_length_mm’ and’body_mass_g’ to predict penguin species.

[7]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
target_columns = ['species']

She starts to write the associated DiffPrivLib pipeline and tries it on the dummy.

If the DiffprivlibCompatibilityWarning is raised by DiffPrivLib library, an warning will be raised the first time (as in DiffPrivLib) then the ‘wrong’ parameters will be ignored within the server.

[8]:
# DiffprivlibCompatibilityWarning Error expected
dpl_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5)),
    ('classifier', models.LogisticRegression(epsilon = 1.0, svd_solver='full'))
])
/usr/local/lib/python3.12/site-packages/diffprivlib/utils.py:71: DiffprivlibCompatibilityWarning: Parameter 'svd_solver' is not functional in diffprivlib.  Remove this parameter to suppress this warning.
  warnings.warn(f"Parameter '{arg}' is not functional in diffprivlib.  Remove this parameter to suppress this "

To resolve the DiffprivlibCompatibilityWarning issue, the svd_solver should not be set as it is incompatible with DiffPrivLib. If these warnings are ignore by the user, the default behaviour of DiffPrivLib will be applied.

If PrivacyLeakWarning are encountered, then the query will not be processed by the server and will return an error.

[9]:
dpl_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5)),
    ('classifier', models.LogisticRegression(epsilon = 1.0))
])
[10]:
# Expect PrivacyLeakWarning Error
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns =  feature_columns,
    target_columns = target_columns,
    dummy = True
)
---------------------------------------------------------------------------
ExternalLibraryException                  Traceback (most recent call last)
Cell In[10], line 2
      1 # Expect PrivacyLeakWarning Error
----> 2 dummy_response = client.diffprivlib.query(
      3     pipeline = dpl_pipeline,
      4     feature_columns =  feature_columns,
      5     target_columns = target_columns,
      6     dummy = True
      7 )

File /code/lomas_client/libraries/diffprivlib.py:153, in DiffPrivLibClient.query(self, pipeline, feature_columns, target_columns, test_size, test_train_split_seed, imputer_strategy, dummy, nb_rows, seed)
    150     r_model = QueryResponse.model_validate_json(data)
    151     return r_model
--> 153 raise_error(res)
    154 return None

File /code/lomas_client/utils.py:38, in raise_error(response)
     36     raise InvalidQueryException(error_message["InvalidQueryException"])
     37 if response.status_code == status.HTTP_422_UNPROCESSABLE_ENTITY:
---> 38     raise ExternalLibraryException(
     39         error_message["library"], error_message["ExternalLibraryException"]
     40     )
     41 if response.status_code == status.HTTP_403_FORBIDDEN:
     42     raise UnauthorizedAccessException(error_message["UnauthorizedAccessException"])

ExternalLibraryException: ('diffprivlib', "PrivacyLeakWarning: Bounds parameter hasn't been specified, so falling back to determining bounds from the data.\n This will result in additional privacy leakage.  To ensure differential privacy with no additional privacy loss, specify `bounds` for each valued returned by np.mean().. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.")

Diffprivlib requests that have PrivacyLeakWarning will not be processed in the server. In lomas, the bounds must always be specified. For most model, it is best to use the standard scaler must always be used as a first step and fill it based on the metadata values.

[11]:
def get_bounds(cols_metadata, columns):
    lower = [cols_metadata[col]["lower"] for col in columns]
    upper = [cols_metadata[col]["upper"] for col in columns]
    return (lower, upper)
[12]:
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
bounds
[12]:
([30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0])
[13]:
dpl_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),
    ('classifier', models.LogisticRegression(epsilon = 1.0))
])
[14]:
# Expect PrivacyLeakWarning Error
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    dummy = True
)
---------------------------------------------------------------------------
ExternalLibraryException                  Traceback (most recent call last)
Cell In[14], line 2
      1 # Expect PrivacyLeakWarning Error
----> 2 dummy_response = client.diffprivlib.query(
      3     pipeline = dpl_pipeline,
      4     feature_columns = feature_columns,
      5     target_columns = target_columns,
      6     dummy = True
      7 )

File /code/lomas_client/libraries/diffprivlib.py:153, in DiffPrivLibClient.query(self, pipeline, feature_columns, target_columns, test_size, test_train_split_seed, imputer_strategy, dummy, nb_rows, seed)
    150     r_model = QueryResponse.model_validate_json(data)
    151     return r_model
--> 153 raise_error(res)
    154 return None

File /code/lomas_client/utils.py:38, in raise_error(response)
     36     raise InvalidQueryException(error_message["InvalidQueryException"])
     37 if response.status_code == status.HTTP_422_UNPROCESSABLE_ENTITY:
---> 38     raise ExternalLibraryException(
     39         error_message["library"], error_message["ExternalLibraryException"]
     40     )
     41 if response.status_code == status.HTTP_403_FORBIDDEN:
     42     raise UnauthorizedAccessException(error_message["UnauthorizedAccessException"])

ExternalLibraryException: ('diffprivlib', 'PrivacyLeakWarning: Data norm has not been specified and will be calculated on the data provided.  This will result in additional privacy leakage. To ensure differential privacy and no additional privacy leakage, specify `data_norm` at initialisation.. Lomas server cannot fit pipeline on data, PrivacyLeakWarning is a blocker.')

Again, we have a Privacy Leak. For the same reason, the data_norm should be computed based on metadata and given as argument as explained in the error message.

[15]:
# The max l2 norm of any row of the data. This defines the spread of data that will be protected by differential privacy.
data_norm = np.sqrt(np.linalg.norm(bounds[1]))
[16]:
dpl_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),
    ('classifier', models.LogisticRegression(epsilon = 1.0, data_norm = data_norm))
])
[17]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    dummy = True
)

The pipeline worked, she can check that she has a dummy model and a dummy score associated. In the case of a Logistic Regression the score is a mean accuracy as specified here. Each model return an associated score. The associated documentation is in the DiffPrivLib documentation in the score method of each model.

[20]:
dummy_response.result.model
[20]:
Pipeline(steps=[('scaler',
                 StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),
                                bounds=(array([  30.,   13.,  150., 2000.]),
                                        array([  65.,   23.,  250., 7000.])),
                                epsilon=0.5)),
                ('classifier',
                 LogisticRegression(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),
                                    data_norm=83.69469642643347))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now that the pipeline seems to work, she also wants to choose another data imputation method: be default the missing data are dropped but she wants the replace them with the mean. Therefore, she uses the imputer_strategy argument.

[21]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    imputer_strategy = "mean",
    dummy = True
)

It also works. It she wanted she could replace by the mean value with imputer_strategy = "mean" or the most frequent value with imputer_strategy = "most_frequent" (most_frequent makes more sense in the case of categorical columns).

Finally, she wants to use as much data as possible to train the model so she decides to reduce the test_size to 0.1 (meaning that 10% of the data will be used as the test set and 90% and the training set). Also she modifies the seed for the random split between training and testing data test_train_split_seed because why not. By default test_size = 0.2 and test_train_split_seed = 1.

[22]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.1,
    test_train_split_seed = 4,
    imputer_strategy = "mean",
    dummy = True
)

She can now estimated the cost of this pipeline

[23]:
res = client.diffprivlib.cost(
    dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.1,
    test_train_split_seed = 4,
    imputer_strategy = "mean",
)
res
[23]:
CostResponse(epsilon=1.5, delta=0.0)
[25]:
f"The cost will be {res.epsilon} epsilon and {res.delta} delta."
[25]:
'The cost will be 1.5 epsilon and 0.0 delta.'

Now we train the same pipeline on the real dataset.

[26]:
res = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.1,
    test_train_split_seed = 4,
    imputer_strategy = "mean",
)
[28]:
f"The accuracy score of the model trained on real data is {res.result.score}."
[28]:
'The accuracy score of the model trained on real data is 0.6.'

The model is with different trained parameters is also available:

[29]:
model = res.result.model

We predict what would be the specie of the smallest possible penguin in all dimension versus to biggest possible penguin in all dimensions.

[30]:
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bounds[0][0], bounds[1][0]],
    'bill_depth_mm': [bounds[0][1], bounds[1][1]] ,
    'flipper_length_mm': [bounds[0][2], bounds[1][2]],
    'body_mass_g': [bounds[0][3], bounds[1][3]]
})

predictions = model.predict(x_to_predict)
x_to_predict["predictions"] = predictions
x_to_predict
[30]:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g predictions
0 30.0 13.0 150.0 2000.0 Adelie
1 65.0 23.0 250.0 7000.0 Gentoo

Step 5: Train other models with DiffPrivLib

The logic is always the same for all the models. The pipeline and feature_columns arguments must always be specified for all models. The target_columns must be specified except for Clustering (K-Means) and Dimensinnality reduction (PCA).

Here are examples of each on dummy dataframes.

Classification: Gaussian Naive Bayes

[31]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
target_columns = ['species']
[32]:
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
[33]:
dpl_pipeline = Pipeline([
    ('scaler', models.StandardScaler(epsilon = 0.5, bounds=bounds)),
    ('gaussian', models.GaussianNB(epsilon = 1.0, bounds=bounds, priors = (0.3, 0.3, 0.4))),
])
[34]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.15,
    imputer_strategy = "median",
    dummy = True
)
[35]:
cost_res = client.diffprivlib.cost(
    dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.15,
    imputer_strategy = "median",
)
cost_res
[35]:
CostResponse(epsilon=1.5, delta=0.0)
[36]:
response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    imputer_strategy = "median",
    test_size = 0.15,
)
[37]:
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bounds[0][0], bounds[1][0]],
    'bill_depth_mm': [bounds[0][1], bounds[1][1]] ,
    'flipper_length_mm': [bounds[0][2], bounds[1][2]],
})
[39]:
predictions = response.result.model.predict(x_to_predict)
x_to_predict["predictions"] = predictions
x_to_predict
[39]:
bill_length_mm bill_depth_mm flipper_length_mm predictions
0 30.0 13.0 150.0 Chinstrap
1 65.0 23.0 250.0 Chinstrap

Random Forest

[40]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'body_mass_g']
target_columns = ['island']
[41]:
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
[42]:
dpl_pipeline = Pipeline([
    (
        'rf',
        models.RandomForestClassifier(
            n_estimators=10,
            epsilon = 2.0,
            bounds=bounds,
            classes=penguin_metadata['columns']['island']['categories']
        )
    ),
])
[43]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    imputer_strategy = "drop", #default
    dummy = True
)
[44]:
cost_res = client.diffprivlib.cost(
    dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    imputer_strategy = "drop", #default
)
cost_res
[44]:
CostResponse(epsilon=2.0, delta=0.0)
[45]:
response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    imputer_strategy = "drop", #default
)
[46]:
model = response.result.model
[47]:
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bounds[0][0], bounds[1][0]],
    'bill_depth_mm': [bounds[0][1], bounds[1][1]] ,
    'body_mass_g': [bounds[0][2], bounds[1][2]]
})
predictions = model.predict(x_to_predict)
x_to_predict["predictions"] = predictions
x_to_predict
[47]:
bill_length_mm bill_depth_mm body_mass_g predictions
0 30.0 13.0 2000.0 Biscoe
1 65.0 23.0 7000.0 Torgersen

Decision Tree Classifier

[48]:
feature_columns = ['bill_length_mm', 'body_mass_g']
target_columns = ['species']
[49]:
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
[50]:
dpl_pipeline = Pipeline([
    (
        'dtc',
        models.DecisionTreeClassifier(
            epsilon = 2.0,
            bounds=bounds,
            classes=penguin_metadata['columns']['species']['categories']
        )
    ),
])
[51]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.2,
    test_train_split_seed = 1,
    dummy = True,
    nb_rows = 100,
    seed = 42
)
[52]:
response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    test_size = 0.2,
)
[53]:
model = response.result.model
[54]:
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bounds[0][0], bounds[1][0]],
    'body_mass_g': [bounds[0][1], bounds[1][1]] ,
})
x_to_predict["predictions"] = model.predict(x_to_predict)
x_to_predict
[54]:
bill_length_mm body_mass_g predictions
0 30.0 2000.0 Gentoo
1 65.0 7000.0 Chinstrap

Regression: Linear Regression

[55]:
feature_columns = ['bill_length_mm']
target_columns = ['bill_depth_mm']
[56]:
bill_length_meta = penguin_metadata['columns']['bill_length_mm']
bill_depth_meta = penguin_metadata['columns']['bill_depth_mm']
[57]:
dpl_pipeline = Pipeline([
    (
        'lr',
        models.LinearRegression(
            epsilon = 2.0,
            bounds_X=(bill_length_meta['lower'], bill_length_meta['upper']),
            bounds_y=(bill_depth_meta['lower'], bill_depth_meta['upper'])
        )
    ),
])
[58]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    target_columns = target_columns,
    dummy = True
)
model = dummy_response.result.model
[59]:
# Dummy model predictions
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bill_length_meta['lower'], bill_length_meta['upper']],
})
x_to_predict["predictions"] = model.predict(x_to_predict)
x_to_predict
[59]:
bill_length_mm predictions
0 30.0 17.985419
1 65.0 17.489243

Clustering: K-Means

[60]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
[61]:
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
[62]:
dpl_pipeline = Pipeline([
    ('kmeans', models.KMeans(n_clusters = 8, epsilon = 2.0, bounds=bounds)),
])
[63]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    dummy = True
)
model = dummy_response.result.model
model
[63]:
Pipeline(steps=[('kmeans',
                 KMeans(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),
                        bounds=(array([  30.,   13.,  150., 2000.]),
                                array([  65.,   23.,  250., 7000.])),
                        epsilon=2.0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[64]:
# Dummy model predictions
x_to_predict = pd.DataFrame({
    'bill_length_mm': [bounds[0][0], bounds[1][0]],
    'bill_depth_mm': [bounds[0][1], bounds[1][1]] ,
    'flipper_length_mm': [bounds[0][2], bounds[1][2]],
    'body_mass_g': [bounds[0][3], bounds[1][3]]
})
x_to_predict["predictions"] = model.predict(x_to_predict)
x_to_predict
[64]:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g predictions
0 30.0 13.0 150.0 2000.0 3
1 65.0 23.0 250.0 7000.0 2

Dimensionality Reduction: PCA

[65]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
bounds = get_bounds(penguin_metadata['columns'], columns=feature_columns)
[66]:
dpl_pipeline = Pipeline([
    (
        'pca',
        models.PCA(
            n_components=None,
            epsilon = 1.0,
            bounds=bounds,
            data_norm=100,
            centered=False
        )
    ),
])
[67]:
dummy_response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
    dummy = True
)
model = dummy_response.result.model
[69]:
response = client.diffprivlib.query(
    pipeline = dpl_pipeline,
    feature_columns = feature_columns,
)
model = response.result.model
[70]:
pca_model = model.steps[0][1]
pca_model
[70]:
PCA(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),
    bounds=(array([  30.,   13.,  150., 2000.]),
            array([  65.,   23.,  250., 7000.])),
    data_norm=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[71]:
pca_model.components_
[71]:
array([[-0.08124548, -0.11603131, -0.06907006,  0.98750455],
       [-0.37988112,  0.74377432,  0.54189149,  0.09404104],
       [-0.11526053, -0.62023805,  0.77538936, -0.0281267 ],
       [ 0.91422345,  0.22054765,  0.31678744,  0.12328802]])
[72]:
pca_model.explained_variance_
[72]:
array([8914.48046055, 1029.95283494,  241.10575291,   94.79455338])
[73]:
pca_model.explained_variance_ratio_
[73]:
array([0.86713922, 0.10018671, 0.02345311, 0.00922096])
[74]:
pca_model.singular_values_
[74]:
array([1536.98969484,  252.77069553,  158.4946581 ,  522.43420759])
[75]:
pca_model.mean_
[75]:
array([  43.71211178,   16.5604082 ,  189.09819947, 4237.9468197 ])
[76]:
pca_model.n_components_
[76]:
4
[77]:
pca_model.noise_variance_
[77]:
0.0

Step 6: See archives of queries

She now wants to verify all the queries that she did on the real data. It is possible because an archive of all queries is kept in a secure database. With a function call she can see her queries, budget and associated responses.

[78]:
previous_queries = client.get_previous_queries()
[79]:
query_1 = previous_queries[0]
query_1
[79]:
{'user_name': 'Dr. Antartica',
 'dataset_name': 'PENGUIN',
 'dp_librairy': 'diffprivlib',
 'client_input': {'dataset_name': 'PENGUIN',
  'diffprivlib_json': '{"module": "diffprivlib", "version": "0.6.4", "pipeline": [{"type": "_dpl_type:StandardScaler", "name": "scaler", "params": {"with_mean": true, "with_std": true, "copy": true, "epsilon": 0.5, "bounds": {"_tuple": true, "_items": [[30.0, 13.0, 150.0, 2000.0], [65.0, 23.0, 250.0, 7000.0]]}, "random_state": null, "accountant": "_dpl_instance:BudgetAccountant"}}, {"type": "_dpl_type:LogisticRegression", "name": "classifier", "params": {"tol": 0.0001, "C": 1.0, "fit_intercept": true, "random_state": null, "max_iter": 100, "verbose": 0, "warm_start": false, "n_jobs": null, "epsilon": 1.0, "data_norm": 83.69469642643347, "accountant": "_dpl_instance:BudgetAccountant"}}]}',
  'feature_columns': ['bill_length_mm',
   'bill_depth_mm',
   'flipper_length_mm',
   'body_mass_g'],
  'target_columns': ['species'],
  'test_size': 0.1,
  'test_train_split_seed': 4,
  'imputer_strategy': 'mean'},
 'response': {'epsilon': 1.5,
  'delta': 0.0,
  'requested_by': 'Dr. Antartica',
  'result': {'res_type': 'diffprivlib',
   'score': 0.6,
   'model': Pipeline(steps=[('scaler',
                    StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),
                                   bounds=(array([  30.,   13.,  150., 2000.]),
                                           array([  65.,   23.,  250., 7000.])),
                                   epsilon=0.5)),
                   ('classifier',
                    LogisticRegression(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),
                                       data_norm=83.69469642643347))])}},
 'timestamp': 1728464751.4143507}
[80]:
query_2 = previous_queries[1]
query_2
[80]:
{'user_name': 'Dr. Antartica',
 'dataset_name': 'PENGUIN',
 'dp_librairy': 'diffprivlib',
 'client_input': {'dataset_name': 'PENGUIN',
  'diffprivlib_json': '{"module": "diffprivlib", "version": "0.6.4", "pipeline": [{"type": "_dpl_type:StandardScaler", "name": "scaler", "params": {"with_mean": true, "with_std": true, "copy": true, "epsilon": 0.5, "bounds": {"_tuple": true, "_items": [[30.0, 13.0, 150.0], [65.0, 23.0, 250.0]]}, "random_state": null, "accountant": "_dpl_instance:BudgetAccountant"}}, {"type": "_dpl_type:GaussianNB", "name": "gaussian", "params": {"priors": {"_tuple": true, "_items": [0.3, 0.3, 0.4]}, "var_smoothing": 1e-09, "epsilon": 1.0, "bounds": {"_tuple": true, "_items": [[30.0, 13.0, 150.0], [65.0, 23.0, 250.0]]}, "random_state": null, "accountant": "_dpl_instance:BudgetAccountant"}}]}',
  'feature_columns': ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'],
  'target_columns': ['species'],
  'test_size': 0.15,
  'test_train_split_seed': 1,
  'imputer_strategy': 'median'},
 'response': {'epsilon': 1.5,
  'delta': 0.0,
  'requested_by': 'Dr. Antartica',
  'result': {'res_type': 'diffprivlib',
   'score': 0.17307692307692307,
   'model': Pipeline(steps=[('scaler',
                    StandardScaler(accountant=BudgetAccountant(spent_budget=[(0.5, 0)]),
                                   bounds=(array([ 30.,  13., 150.]),
                                           array([ 65.,  23., 250.])),
                                   epsilon=0.5)),
                   ('gaussian',
                    GaussianNB(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),
                               bounds=(array([ 30.,  13., 150.]),
                                       array([ 65.,  23., 250.])),
                               priors=(0.3, 0.3, 0.4)))])}},
 'timestamp': 1728464776.4250495}
[81]:
query_3 = previous_queries[2]
query_3
[81]:
{'user_name': 'Dr. Antartica',
 'dataset_name': 'PENGUIN',
 'dp_librairy': 'diffprivlib',
 'client_input': {'dataset_name': 'PENGUIN',
  'diffprivlib_json': '{"module": "diffprivlib", "version": "0.6.4", "pipeline": [{"type": "_dpl_type:RandomForestClassifier", "name": "rf", "params": {"n_estimators": 10, "n_jobs": 1, "random_state": null, "verbose": 0, "warm_start": false, "max_depth": 5, "epsilon": 2.0, "bounds": {"_tuple": true, "_items": [[30.0, 13.0, 2000.0], [65.0, 23.0, 7000.0]]}, "classes": ["Torgersen", "Biscoe", "Dream"], "shuffle": false, "accountant": "_dpl_instance:BudgetAccountant"}}]}',
  'feature_columns': ['bill_length_mm', 'bill_depth_mm', 'body_mass_g'],
  'target_columns': ['island'],
  'test_size': 0.2,
  'test_train_split_seed': 1,
  'imputer_strategy': 'drop'},
 'response': {'epsilon': 2.0,
  'delta': 0.0,
  'requested_by': 'Dr. Antartica',
  'result': {'res_type': 'diffprivlib',
   'score': 0.417910447761194,
   'model': Pipeline(steps=[('rf',
                    RandomForestClassifier(accountant=BudgetAccountant(spent_budget=[(2.0, 0)]),
                                           bounds=(array([  30.,   13., 2000.]),
                                                   array([  65.,   23., 7000.])),
                                           classes=['Torgersen', 'Biscoe',
                                                    'Dream'],
                                           epsilon=2.0))])}},
 'timestamp': 1728464804.2231035}
[ ]: