Getting Started¶
How to get started using the ENTR data stack.
Installating the ENTR Runtime¶
This section contains information on how to install and begin using the ENTR Runtime for analysis. For instructions on how to develop the components of the ENTR runtime itself, please see the Developer Setup section.
Pull and Run the Image¶
Install Docker Desktop on your workstation (see instructions).
We recommend following this guide to install Docker on Windows. After installing the WSL 2 backend and Docker you should be able to run containers using Windows PowerShell.
Pull and run our image from github container registry:
docker pull ghcr.io/entralliance/entr_runtime:latest
Note: There are numerous security considerations when pulling and running images from the public internet. User should take the necessary steps to ensure operational security.
Run the entr runtime container, forwarding the necessary ports:
docker run -p 8888:8888 ghcr.io/entralliance/entr_runtime:latest
Open the Jupyter link printed to the terminal in your web browser.
Building your own Image¶
In most cases, we reccomend using the pre-built entr_runtime image avaialble from the github container registry. If you need to rebuild the image yourself, follow the instructions below:
Install Git and Docker Desktop on your workstation.
Clone the ENTR Runtime repository:
git clone git@github.com:entralliance/entr_runtime.git
git checkout dev
Navigate to the
entr_runtime
directory and run the following, replacingyourname
with your username:
docker build -t yourname/entr-runtime docker
Note: Use the option ``–no-cache`` to force rebuilding of each layer
Run the image you just built:
docker run -p 8888:8888 yourname/entr_runtime
Developing with the ENTR Runtime¶
This section contains information on how to begin developing the components of the ENTR Runtime environment.
Manually Running ENTR in Dev Mode¶
The ENTR runtime contains the following preinstalled components: OpenOA, entr_warehouse, and py-entr. To develop these components, you check out development versions of these packages to your local filesystem, and then start the entr image with these paths mounted as volumes. You then install the packages from these volumes in editable mode. This allows you to edit the code in these components on your local machine, and see the changes immediately reflected in the runtime. If $ENTR_HOME
is the directory you’d like to work from:
cd $ENTR_HOME
git clone https://github.com/entralliance/entr_warehouse.git
git clone https://github.com/entralliance/OpenOA.git
git clone https://github.com/entralliance/py_entr.git
git clone https://github.com/entralliance/entr_runtime.git
Optionally, build the entr image. You can also use the dev image from the container registry as discussed in the quickstart guide.
Now, start the entr container in dev mode, mapping the directories you checked out to paths within the container:
docker run -p 8888:8888 -v $ENTR_HOME/OpenOA:/home/jovyan/src/OpenOA -v $ENTR_HOME/entr_warehouse:/home/jovyan/src/entr_warehouse -v $ENTR_HOME/py-entr:/home/jovyan/src/py-entr
.Once inside the container, you will then need to re-install OpenOA in editable mode, or run
dbt run
as needed to materialize any changes to the dbt model code in the warehouse.To install OpenOA in editable mode:
cd /home/jovyan/src/OpenOA
pip install -e .
To re-run DBT:
cd /home/jovyan/src/entr_warehouse
dbt run
Updating the Warehouse¶
Changes to the warehouse may require re-running dbt. To do this:
Open a terminal from Jupyter (File > New > Terminal) and navigate to the location where your dbt project is installed (see section “Assumed Repository Structure” section below) using
cd ~/src/entr_warehouse
and rundbt debug
to test your connection to the Spark warehouse.Once the connection to the warehouse is confirmed, install the dbt packages for your project using
dbt deps
Seed the metadata tables contained in the entr_warehouse repo using
dbt seed
to instantiate them in the Spark warehouse(Re-)register example or newly added source data files with
dbt run-operation stage_external_sources
Run
dbt run
to build all models in the Spark warehouse, which can now be consumed by any application connected to the Spark warehouse such as OpenOA
Advanced Topics¶
Extra ports:
docker run -p 8888:8888 -p 8080:8080 -p 4040:4040 entr/entr-runtime
Override OpenOA and entr_warehouse with local versions:
docker run -p 8888:8888 -p 8080:8080 -p 4040:4040 -v <path-to-local-clone-of-OpenOA>:/home/jovyan/src/OpenOA -v <path-to-local-clone-of-entr_warehouse>:/home/jovyan/src/entr_warehouse entr/entr-runtime-dev
Note, you will then need to re-install OpenOA in editable mode, or run dbt run
as needed to update the container with the new code.
Beeline connect string for ENTR warehouse:
beeline
!connect jdbc:hive2://localhost:10000
Using VSCode Dev Container¶
We provide an example VSCode Dev Container config which can be used to get up and running quickly developing the ENTR platform. This is the recommended method if you use VSCode and are familliar with VSCode Dev Containers.
git clone https://github.com/jordanperr/entr_dev_environment.git
cd entr_dev_environment
git submodule update --init --remote
Then, open the project with VSCode and follow the prompts to initialize the dev container.
Run ENTR on Your Data¶
If you’re already using dbt, you should install dbt-openoa in your dbt project and follow the guidelines below for how to build models to feed the ENTR transformation pipeline to leverage OpenOA.
The ENTR Warehouse is an example dbt project available on the ENTR runtime - the easiest way to test out the functionality of the ENTR data stack on your own data if you aren’t already using dbt is to load data as a CSV
Welcome to the ENTR Data Warehouse
Background¶
The ENTR Warehouse is a dbt project with the goal of providing a common ground of data (formats and transformation methods) upon which renewable energy industry users can build and share analytical applications. Once an industry user integrates his or her data into the generic fact and dimension tables in the ENTR model, he or she will then be able to utilize any associated applications that were built on top of the standard ENTR table schema.
Getting Started¶
The ENTR Runtime Docker image contains all of the dependencies needed for this tutorial including a standalone Apache Spark warehouse that can be used for running everything contained within the ENTR Warehouse dbt project. See the installation guide here for how to build and set up the ENTR Runtime.
ENTR Data Model¶
dbt docs for the entr_warehouse dbt project can be found at https://entralliance.github.io/entr_warehouse. This interface is useful for exploring and understanding the ENTR data model.
How to Bring Your Own Data¶
Note: the following steps require at least basic experience with building models in dbt.
Loading New Data from Files¶
The ENTR Runtime image contains pre-built models defined by the ENTR Warehouse based on open-source example data; however, for users wishing to bring their own data, the ENTR Warehouse supports setup of new sources from CSV and other Spark-readable file types by leveraging the dbt-external-tables package from dbt-labs.
With a clone of this entr_warehouse project mounted to the ENTR Runtime, drop a copy of the file you’d like to process through the ENTR data model into the
data/
directoryWithin the
models/staging/
directory, write out the source definition for the new file within a YML file in the staging directory using the dbt-external-tables guides as neededNote: the new files can be added to any YML file in the
models
folder but must be mapped under theentr_warehouse
: .. code-block:: ymlsources: - name: entr_warehouse
- tables:
name: <new table name> description: <description of new source table> external:
location: ‘<path to data file withing the container>’ # e.g. “/home/jovyan/src/entr_warehouse/data/la_haute_borne_plant_data_sample.csv” - this depends on where you’ve mounted the entr_warehouse dir in the container using: csv # specify for different file types accordingly options:
header: ‘true’ # optional but used with the ENTR sample data
Run
dbt run-operation stage_external_sources
to make the file available as a table in the ENTR runtime Spark warehouse and as a source relation in dbt from which you can start building further transformationsSee the four files within the
data/
folder and their corresponding source definitions within the entr_sample_data.yml file for examples
Transforming New Data to ENTR Standard Formats¶
Once the new file is set as a source, you will need to transform the data into the standard ENTR fact table format - to build the dbt transformations, you’ll need to define and map the dimensional components of the new data utilizing the standard ENTR dimension table formats.
1. (Optional) Create an Intermediate Model to Facilitate Table Reshaping¶
You’ll likely notice that the initial step in the transformation of the example sources (files) is just performing type casting (see the examples within the models/staging/entr_sample_data/intermediate directory), e.g. the int_entr_scada_sample__cast model, which just performs type casting on the raw data as a preliminary step
Prepares the data for reshaping/pivoting; we expect this will be a frequently necessary staging step for source files with tags corresponding to data types
Assignment of dimensional keys, e.g. here - see below for further detail
2. Establish Link Between New Facts and Dimensions¶
For the assignment of dimensional foreign keys, which is required for all ENTR Warehouse fact tables in their current state, the ENTR dimensions must be extended. For example, if the data you’re preparing is not from La Haute Borne, a new plant must be added as a record within the seeds/seed_asset_plant dbt CSV seed file in order for the transformations to run properly, and the same goes for the other dimensional data assignments.
Note: not every field within the dimensions will be useful or used in analyses or transformations, so it may be ok to leave some blank to start depending on your use case
In addition to extending the seeded ENTR dimensions, it may be useful or necessary to seed mapping files that are specific to a source to facilitate the translation of data into ENTR vernacular; we expect this to most commonly be done for mapping identifiers in the data you bring to ENTR tag IDs within the ENTR dimension
For example, the following files map the example 4 La Haute Borne example data sets’ fields to ENTR tags. These tables are used to join on the ENTR tag IDs to the appropriate fields once the source table has been reshaped/unpivoted to have the original column names in a field
Note: we don’t yet have standards defined for creating new ENTR tags, but that functionality will be coming soon
3. Align Staging Model with Associated ENTR Fact Table Schema¶
Once all metadata about the new data from the newly loaded file is available, the last staging step is transforming the data into the relevant ENTR generic fact table schema, which can be found in this project’s dbt docs, e.g. fct_entr_wtg_scada for the generic wind turbine SCADA data fact table schema - the staging model stg_entr_scada_sample performs the final transformation on the example SCADA data from La Haute Borne to make it match the table schema of the fct_entr_wtg_scada model. The current generic ENTR fact tables are as follows:
-
Note: this model is a good example showing 2 staged data sets (the ERA5 and MERRA-2 La Haute Borne reanalysis data) flowing into the same generic ENTR fact table for guidance in the following step
4. Add Newly Staged Data to ENTR Fact Table¶
Once a staging model has been created for your new source data that matches the associated generic ENTR fact table schema, you will just need to union that new staging model with the generic ENTR fact table to make the new data ready for consumption by ENTR-based applications. The fct_entr_reanalysis_data shows how multiple staging models are combined in the generic ENTR reanalysis model.
Resources¶
Learn more about dbt from the dbt docs
Check out the dbt Discourse for commonly asked questions and answers
Join the dbt Slack chat for live discussions and support
Find dbt events near you
Check out the dbt blog for the latest news on dbt’s development and best practices
Running OpenOA on ENTR¶
The ENTR Runtime includes example analysis notebooks that demonstrate operational wind plant data analytics use cases using OpenOA with example data stored in the ENTR warehouse. The example notebooks are located at /examples in the ENTR Runtime Docker workspace. All examples use two years of data for the 4-turbine “La Haute Borne” wind plant.
Running the Examples¶
Complete the Installation section and open Jupyter Hub in your web browser.
On the left-hand side of Jupter Hub, navigate to:
examples
Double click on any example notebook to open it.
List of Examples¶
The ENTR runtime contains two example notebooks:
OpenOA documentation¶
OpenOA documentation is hosted on ReadTheDocs.
Data are stored and organized in OpenOA using a PlantData
object. The PlantData
class uses the plantdata.from_entr
method from the py-entr package (link to code) to load data into OpenOA from the ENTR Warehouse.