Advanced Usage¶
In addition to these tutorials, you can find more basic examples at User Guide - Basic Usage. and use-cases User Guide - Use-Cases.
Import CSV to Dataverse¶
This tutorial will show you how to mass-import metadata from pyDataverse’s own CSV format (see CSV templates), create pyDataverse objects from it (Datasets and Datafiles) and upload the data and metadata through the API.
The CSV format in this case can work as an exchange format or kind of a bridge between all kind of data formats and programming languages. Note that this can be filled directly by humans who collect the data manually (such as in digitization projects) as well as through more common automation workflows.
Prepare¶
Requirements
- pyDataverse installed (see Installation)
Information
- Follow the order of code execution
- Dataverse Docker 4.18.1 used
- pyDataverse 0.3.0 used
- API responses may vary by each request and Dataverse installation!
Warning
Do not execute the example code on a Dataverse production instance, unless 100% sure!
Additional Resources
- CSV templates from
src/pyDataverse/templates/
are used (see CSV templates) - Data from
tests/data/user-guide/
is used (GitHub repo)
Adapt CSV template¶
Add metadata to the CSV files¶
After preparing the CSV files, the metadata will need to be collected (manually or programmatically). No matter the origin or the format, each row must contain one entity (Dataverse collection, Dataset or Datafile).
As mentioned in “Additional Resources” in the tutorial we use prepared data and place it in the root directory. You can ether use our files or fill in your own metadata with your own datafiles.
No matter what you choose, you have to have properly formatted CSV files
(datasets.csv
and datafiles.csv
) before moving on.
Don’t forget: Some columns must be entered in a JSON format!
Add datafiles¶
Add the files you have filled in the org.filename
cell in datafiles.csv
and then place them in the root directory (or any other specified directory).
Import CSV files¶
Import the CSV files with
read_csv_as_dicts()
.
This creates a list of dict
’s, automatically imports
the Dataverse Software’s own metadata attribute (dv.
prefix),
converts boolean values, and loads JSON cells properly.
>>> import os
>>> from pyDataverse.utils import read_csv_as_dicts
>>> csv_datasets_filename = "datasets.csv"
>>> ds_data = read_csv_as_dicts(csv_datasets_filename)
>>> csv_datafiles_filename = "datafiles.csv"
>>> df_data = read_csv_as_dicts(csv_datafiles_filename)
Once we have the data in Python, we can easily import the data into pyDataverse.
For this, loop over each Dataset dict
, to:
>>> from pyDataverse.models import Dataset
>>> ds_lst = []
>>> for ds in ds_data:
>>> ds_obj = Dataset()
>>> ds_obj.set(ds)
>>> ds_lst.append(ds_obj)
To import the Datafile
’s, do
the same with df_data
:
set()
the Datafile metadata, and
append it.
>>> from pyDataverse.models import Datafile
>>> df_lst = []
>>> for df in df_data:
>>> df_obj = Datafile()
>>> df_obj.set(df)
>>> df_lst.append(df_obj)
Upload data via API¶
Before we can upload metadata and data, we need to create an instance of
NativeApi
.
You will need to replace the following variables with your own Dataverse installation’s data
before executing the lines:
- BASE_URL: Base URL of your Dataverse installation, without trailing slash (e. g.
https://data.aussda.at
)) - API_TOKEN: API token of a Dataverse user with proper rights to create a Dataset and upload Datafiles
>>> from pyDataverse.api import NativeApi
>>> api = NativeApi(BASE_URL, API_TOKEN)
Loop over the list
of Dataset
’s,
upload the metadata with
create_dataset()
and collect
all dataset_id
’s and pid
’s in dataset_id_2_pid
.
Note: The Dataverse collection assigned to dv_alias
must be published in order to add a Dataset to it.
>>> dv_alias = ":root:"
>>> dataset_id_2_pid = {}
>>> for ds in ds_lst:
>>> resp = api.create_dataset(dv_alias, ds.json())
>>> dataset_id_2_pid[ds.get()["org.dataset_id"]] = resp.json()["data"]["persistentId"]
Dataset with pid 'doi:10.5072/FK2/WVMDFE' created.
The API requests always return a
requests.Response
object, which can then be used
to extract the data.
Next, we’ll do the same for the list
of
Datafile
’s with
upload_datafile()
.
In addition to the metadata, the PID
(Persistent Identifier, which is mostly the DOI) and the filename
must be passed.
>>> for df in df_lst:
>>> pid = dataset_id_2_pid[df.get()["org.dataset_id"]]
>>> filename = os.path.join(os.getcwd(), df.get()["org.filename"])
>>> df.set({"pid": pid, "filename": filename})
>>> resp = api.upload_datafile(pid, filename, df.json())
Now we have created all Datasets, which we added to datasets.csv
, and uploaded
all Datafiles, which we placed in the root directory, to the Dataverse installation.
Publish Datasets via API¶
Finally, we iterate over all Datasets and publish them with
publish_dataset()
.
>>> for dataset_id, pid in dataset_id_2_pid.items():
>>> resp = api.publish_dataset(pid, "major")
>>> resp.json()
Dataset doi:10.5072/FK2/WVMDFE published
{'status': 'OK', 'data': {'id': 444, 'identifier': 'FK2/WVMDFE', 'persistentUrl': 'https://doi.org/10.5072/FK2/WVMDFE', 'protocol': 'doi', 'authority': '10.5072', 'publisher': 'Root', 'publicationDate': '2021-01-13', 'storageIdentifier': 'file://10.5072/FK2/WVMDFE'}}
The Advanced Usage tutorial is now finished! If you want to revisit basic examples and use cases you can do so at User Guide - Basic Usage and User Guide - Use-Cases.