...
root¶
This scenario uses two CSV data sources:
a dataset with NYC Taxi trips in January 2015
a second dataset of NYC weather data for the same period
The aim is to produce a line chart enabling us to progressively observe the possible relationship between cab activity and precipitation levels. After starting ProgressiVis, create a CSV loader for weather data as follows:
Select "CSV loader" in Loader list
Enter the alias "Weather"
press "Create"
...
Weather¶
In order to create the loader for the weather data you have to:
- Enter and "sniff" the CSV file this way:
- If the URL
https://www.aviz.fr/nyc-taxi/nyc_weather_2015.csvis present in theBookmarkslist, select it, otherwise copy ans paste it in theNew URLsfield. - Open the sniffer by pressing
Sniff - Uncheck
Enable/disable allcheckbox
- If the URL
- Select and configure the necessary columns:
Datecolumn:- Select
Datefrom thePColumnlist. Details are then displayed then in theSelectedPColumnframe - check the
Usecheckbox. The other fields in the frame are then unlocked - select
datetimein theRetypeselect list
- Select
PrecipitationIncolumn- Select
PrecipitationInin thePColumnlist. Details are displayed. - check the
Usecheckbox. The other fields in the frame appear. - select
float64in theRetypeselect list - check the "NA values" checkbox. In the field with the same name, enter "T".
- Select
Explanation¶
The PrecipitationIn column contains numerical values except for cells containing the letter "T" which stands for trace precipitation amount or snow depth (an amount too small to measure, usually < 0.005 inches water equivalent) (appears instead of numeric value) according to the Local Climatological Data (LCD) Dataset Documentation.
With the above transformations, the PrecipitationIn column will be of type float64 and the "T" value will become NaN.
Press Test to view a sample of data.
Select the Hide tab to hide the samples.
At this point, optionally you can save the settings for later. by modifying (or not) the default file name (w202...) then press Save settings
You can start loading with the Start loading button.
To continue building the scenario, we'll need to add a 2nd data source, cab fares.
Select Next stage: CSV loader and enter the name Taxis as optional alias, then press Chain it.
Constructor.widget('Weather', 0)
Taxis¶
Like above:
If the URL
https://www.aviz.fr/nyc-taxi/yellow_tripdata_2015-01.csv.bz2is present in theBookmarkslist, select it, otherwise copy and paste it in theNew URLsfield.Open the sniffer by pressing
SniffUncheck
Enable/disable allcheckboxSelect
tpep_pickup_datetimefrom thePColumnlist (the single column needed here).check the
Usecheckbox.select
datetimein theRetypeselect listOptionally save settings for later use as above.
Now we're going to join the two data sources we've already created.Select
Next stage:Jointhen pressChain it
Constructor.widget('Taxis', 0)
Join¶
ProgressiVis currently only implements 1 to N joins (covering the particular case of 1 to 1 joins).
In addition, ProgressiVis represents the datetime type as a row vector of size 6 of the form YMDhms.
This allows a join to be made on part of two datetime columns, for example on the YMD elements.
In the current case, the Weather table contains one row per day (so the Date column will be of the form [Y, M, D, 0, 0, 0]), while the Taxis table contains one row per trip, with the tpep_pickup_datetime column having all 6 YMDhms elements fully populated.
By making a join between the two tables on the YMD subcolumns of the two columns mentioned, we'll associate all the races of a day with the corresponding weather line.
The join will be an "inner join", as orphan elements in one of the two tables, if they exist, are of no interest for our scenario.
In the interface, we call the table corresponding to part "1" in the "1 to N" relationship "Primary Table" and the table corresponding to part "N" "Relative Table".
In concrete terms, in our case, the Weather table will be primary and the Taxis table will be relative.
In the first frame, on the left (Inputs) select the 2nd table to be joined (Weather) and on the right assign the roles (Weather => Primary, Taxis => Related) and confirm with the "OK" button.
In the next frame, we'll define the join that will be performed on the YMD subcolumns of the Weather.Date and Taxis.tpep_pickup_datetime columns.
The checkboxes in the "Keep" column can be used to exclude certain unnecessary columns from the join (but we don't need them here).
Keep the default value How: innerand press Start
Constructor.widget('Join', 0)
# progressivis-snippets
import numpy as np
@register_function
@np.vectorize
def rain_level(val: float) -> str:
if np.isnan(val) or val < 0.07:
return "No"
if val < 0.19:
return "Light"
return "Rain"
Rich view¶
Alas, the information contained in the columns of the join is not in a form suited to the intended processing. Fortunately, ProgressiVis allows you to create calculated (or virtual) columns in addition to the stored columns of a table.
A calculated column is an object that applies a function to one or more columns in the same table and provides the same interface as a stored column.
To take this a step further, we're now going to create a join view, which we'll enrich with 3 calculated columns required for further processing.
Two of these (hour() and year_day()) are standard datetime functions and are already available in the ProgressiVis library.
The third (rain_level()) is a function specific to our application. It transforms quantitative precipitation data into categorical data ("Rain", "Light" and "No") more suited to the Line chart visualization we're planning.
But before creating the view, we'll define the rain_level() function (see cell above).
A few explanations on the coding conventions used in this cell:
the comment
# progressivis-snippetson the first line of the cell is important because it tells theipyprogressivisloader that this cell must be reexecuted if the scenario is replayed.the
@register_functiondecorator registers the function with the various widgets supposed to provide it to the analyst (via a select list, for example)the
@vectorize decoratoris part ofnumpyand is described hereIn this implementation, the precipitation level "T", (converted in
NaN), will be categorized as "No".To create the computed view, select
Next stage:Computed viewwith the aliasRich viewthen pushChain it.Each computed column is defined by associating a column (the list on the left) and a function (the list in the middle).
The calculated column appears on the right. The default name is intended to avoid naming conflicts, but we recommend changing it to a shorter, more expressive name if possible.
You can also change the column type if the one proposed is not appropriate.
When all is OK, check the
Usecheckbox.The three columns have to be created as follows:
tpep_pickup_datetime+hour()=>pickup_hourtpep_pickup_datetime+year_day()=>pickup_year_dayPrecipitationIn+rain_level()=>rain_levelAmong the stored columns to be kept in the view, we keep
PrecipitationIn(because the view must contain at least one stored columns).When everything is ready, click
Apply.To move on to the next stage, click
Next stage:Group bythenChain it.
Constructor.widget('Rich view', 0)
Group by¶
This step builds an index to group together the rows produced previously that share the same values for the pair (pickup_hour, rain_level).
This index will be used in the next stage to aggregate the values thus selected.
To proceed with aggregation, select Next stage: Aggregate then Chain it.
Constructor.widget('Group by', 0)
Aggregate¶
The displayed matrix maps columns (vertically) to operations (horizontally) with one exception: the first row of the matrix named "RECORD" represents not a particular column but the entire row of the input table.
The only possible operation on rows is their count. We count rows sharing the same pair (pickup_hour, rain_level), which will produce the _count column in the output table.
A second necessary aggregation is the nunique operation on the pickup_year_day column, which will count the distinct values of days sharing the same pair (pickup_hour, rain_level).
The output will be a table containing 24 time intervals * 3 rain levels = 92 rows
Before displaying, a normalization step is required. This will be performed via a new computed view.
To proceed with normalization, select Next stage: Computed view with the alias Norm, then Chain it.
Constructor.widget('Aggregate', 0)
Norm¶
The"_count" column of the aggregation shows the number of pickups performed for each pair (pickup_hour, rain_level).
As it stands, this value cannot be used to measure cab activity, as it depends on the meteorological factor (the number of rainy days etc.).
It therefore needs to be normalized by dividing this value (_count) by the total number of distinct days in each pair (pickup_hour, rain_level).
In the interface, we now select the two columns _count and pickup_year_day_nunique and the function //. We'll associate each function parameter with a column as follows:
numerator
numwith column_countthe denominator
denwith the columnpickup_year_day_nunique.Rename the new column
trips_per_dayand check theUsecheckbox.We also keep the
pickup_hourandrain_levelcolumns in the result.We now have all the elements we need for display.
To do this, select
Next stage:Any vegawith the aliasLine chartthenChain it.
Constructor.widget('Norm', 0)
Line chart¶
If the schema line_chart.json is present in the Schema list, select it, otherwise you can obtain it here and then copy it directly into the JSon editor.
Use the Fetch info button to extract the names of the fields present in the schema and associate them with the columns of the input table as follows:
x=>pickup_houry=>trips_per_daycolor=>rain_levelthen press
Apply.
Constructor.widget('Line chart', 0)