Skip to content

Spark Examples

Getting Started with Azure Databricks - A Census Data Example

Section titled “Getting Started with Azure Databricks - A Census Data Example”

This notebook is a simple example of working with data in Azure Databricks.

If you are reading this on the wiki, you can find the working Notebook at the following path: Getting Started with Azure Databricks - A Census Data Example

You can explore filesystems directly through Notebooks by using the dbutils.fs client.

Below we exaplore the CSVs for the night population census:

display(dbutils.fs.ls("/"))
pathnamesize
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/data/0
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/tables/0
display(dbutils.fs.ls("abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/"))
pathnamesize
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/data/0
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/tables/0
display(dbutils.fs.ls("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz"))
pathnamesize
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csvData8317.csv857219761
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupAge8317.csvDimenLookupAge8317.csv2720
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupArea8317.csvDimenLookupArea8317.csv65400
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupEthnic8317.csvDimenLookupEthnic8317.csv272
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupSex8317.csvDimenLookupSex8317.csv74
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupYear8317.csvDimenLookupYear8317.csv67

As you can see above, the ADLS Gen2 container for the project (abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/) is set as the default filesystem for clusters associated with that project (in this case sandbox).

We can read the source CSVs into Spark as DataFrames and explore the data.

We use the DataFrame API to transform the DataFrames.

df = spark.read.csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
display(df.limit(10))
_c0_c1_c2_c3_c4_c5
YearAgeEthnicSexAreacount
20180001101807
201800011025109
201800011032262
201800011041359
20180001105180
20180001106741
20180001107633
201800011081206
201800011092184

The datatypes and headers look incorrect: all the datatypes are strings and the headers have not been read. We can fix this by passing in reader options.

df = spark.read.option("header", True).option("inferSchema", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
display(df.limit(10))
_c0_c1_c2_c3_c4_c5
YearAgeEthnicSexAreacount
20180001101807
201800011025109
201800011032262
201800011041359
20180001105180
20180001106741
20180001107633
201800011081206
201800011092184
df.count()
Out[11]: 34959673

Let’s take a look at a couple of the dimension tables too.

df = spark.read.option("header", True).option("inferSchema", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupAge8317.csv")
display(df.limit(10))
CodeDescriptionSortOrder
999999Total people - age group1
888Median age2
1Under 15 years3
215-29 years4
330-64 years5
465 years and over6
10-4 years7
25-9 years8
310-14 years9
415-19 years10
df = spark.read.option("header", True).option("inferSchema", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupEthnic8317.csv")
display(df.limit(10))
CodeDescriptionSortOrder
9999Total people1
1European2
2Maori3
3Pacific Peoples4
4Asian5
5Middle Eastern/Latin American/African6
6Other ethnicity7
61New Zealander10
69Other ethnicity nec11
77Total people stated8

Observation: the dimension tables seem to have a consistent schema.

Since we are using Apache Spark we want to limit joins the number of joins/data transfer between nodes. Filters and aggregations suit the architecture better, and data will be stored in columnar-compressed files therefore it would make sense to denormalise the data.

Let’s join all the data into one large DataFrame:

from pyspark.sql.functions import col
denorm_df = spark.read.option("header", True).option("inferSchema", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
for dim in ["Age", "Area", "Ethnic", "Sex", "Year"]:
dim_df = spark.read.option("header", True).option("inferSchema", True).csv(f"/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookup{dim}8317.csv")
denorm_df = denorm_df.join(dim_df, col(dim) == col("Code")).drop("Code", dim).withColumnRenamed("Description", dim).withColumnRenamed("SortOrder", f"{dim}SortOrder")
display(denorm_df.limit(10))
countAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
807Less than one year28Northland Region4European2Male220183
5109Less than one year28Auckland Region5European2Male220183
2262Less than one year28Waikato Region6European2Male220183
1359Less than one year28Bay of Plenty Region7European2Male220183
180Less than one year28Gisborne Region8European2Male220183
741Less than one year28Hawke's Bay Region9European2Male220183
633Less than one year28Taranaki Region10European2Male220183
1206Less than one year28Manawatu-Wanganui Region11European2Male220183
2184Less than one year28Wellington Region12European2Male220183
177Less than one year28West Coast Region16European2Male220183
denorm_df.count()
Out[18]: 48585735

Pre-join count: 34959673

Post-join count: 48585735

The counts look incorrect: the dimension joins are only lookups and should not produce additional rows.

Let’s look into why this has happened.

Hypothesis: the code column shouldn’t be inferred as an integer column.

With schema inference:

df = spark.read.option("header", True).option("inferSchema", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupAge8317.csv")
display(df.limit(10))
CodeDescriptionSortOrder
999999Total people - age group1
888Median age2
1Under 15 years3
215-29 years4
330-64 years5
465 years and over6
10-4 years7
25-9 years8
310-14 years9
415-19 years10

Without schema inference:

df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupAge8317.csv")
display(df.limit(10))
CodeDescriptionSortOrder
999999Total people - age group1
888Median age2
1Under 15 years3
215-29 years4
330-64 years5
465 years and over6
010-4 years7
025-9 years8
0310-14 years9
0415-19 years10

Schema inference shot us in the foot! 🦶🔫

Try again without infering any datatypes (we can manually cast later!):

from pyspark.sql.functions import col
denorm_df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
for dim in ["Age", "Area", "Ethnic", "Sex", "Year"]:
dim_df = spark.read.option("header", True).csv(f"/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookup{dim}8317.csv")
denorm_df = denorm_df.join(dim_df, col(dim) == col("Code")).drop("Code", dim).withColumnRenamed("Description", dim).withColumnRenamed("SortOrder", f"{dim}SortOrder")
denorm_df.count()
Out[24]: 34885323

Pre-join count: 34959673

Post-join count: 34885323

Closer, but it looks like we lost a few rows this time.

Let’s try a left join:

from pyspark.sql.functions import col
denorm_df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
for dim in ["Age", "Area", "Ethnic", "Sex", "Year"]:
dim_df = spark.read.option("header", True).csv(f"/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookup{dim}8317.csv")
denorm_df = denorm_df.join(dim_df, col(dim) == col("Code"), how="left").drop("Code", dim).withColumnRenamed("Description", dim).withColumnRenamed("SortOrder", f"{dim}SortOrder")
denorm_df.count()
Out[26]: 34959672

Bingo! The counts match. However, this implies something doesn’t join to it’s dimension lookup.

Let’s hunt for nulls:

display(denorm_df.limit(10))
countAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
807Less than one year28Northland Region4European2Male220183
5109Less than one year28Auckland Region5European2Male220183
2262Less than one year28Waikato Region6European2Male220183
1359Less than one year28Bay of Plenty Region7European2Male220183
180Less than one year28Gisborne Region8European2Male220183
741Less than one year28Hawke's Bay Region9European2Male220183
633Less than one year28Taranaki Region10European2Male220183
1206Less than one year28Manawatu-Wanganui Region11European2Male220183
2184Less than one year28Wellington Region12European2Male220183
177Less than one year28West Coast Region16European2Male220183
denorm_df_nulls = denorm_df.filter(col("Age").isNull() | col("Area").isNull() | col("Ethnic").isNull() | col("Sex").isNull() | col("Year").isNull())
denorm_df_nulls.count()
Out[29]: 74349
display(denorm_df_nulls.limit(10))
countAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
50Median age2nullnullnullnullMale220183
48.5Median age2nullnullnullnullMale220183
49.2Median age2nullnullnullnullMale220183
29.5Median age2nullnullnullnullFemale320183
42.3Median age2nullnullnullnullFemale320183
36Median age2nullnullnullnullFemale320183
21.2Median age2nullnullnullnullnullnull20183
24.9Median age2nullnullnullnullnullnull20183
23.3Median age2nullnullnullnullnullnull20183
26.6Median age2nullnullnullnullnullnull20183
denorm_df_nulls.filter(col("Age") == "Median age").count()
Out[31]: 74349

All the duplicates come from the median age category.

We should take some time to understand our data!

display(spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupAge8317.csv").limit(10))
CodeDescriptionSortOrder
999999Total people - age group1
888Median age2
1Under 15 years3
215-29 years4
330-64 years5
465 years and over6
010-4 years7
025-9 years8
0310-14 years9
0415-19 years10

Let’s filter the fact table by the top two codes as they look odd:

df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv").filter(col("Age").isin(["999999", "888"]))
display(df.limit(10))
YearAgeEthnicSexAreacount
2018999999110165307
20189999991102418347
20189999991103169422
20189999991104110733
2018999999110513566
2018999999110660591
2018999999110749086
2018999999110892655
20189999991109186054
2018999999111215735
display(df.filter(col("Age") == "888").limit(10))
YearAgeEthnicSexAreacount
2018888110146.4
2018888110238.5
2018888110340
2018888110443.4
2018888110541.4
2018888110643.6
2018888110740.6
2018888110840.8
2018888110938.4
2018888111247.7

It’s getting a little hard to trace columns and codes, so let’s denormalise whilst retaining the code columns:

from pyspark.sql.functions import col
denorm_df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
for dim in ["Age", "Area", "Ethnic", "Sex", "Year"]:
dim_df = spark.read.option("header", True).csv(f"/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookup{dim}8317.csv")
denorm_df = denorm_df.join(dim_df, col(dim) == col("Code"), how="left").drop("Code").withColumnRenamed(dim, f"{dim}Code").withColumnRenamed("Description", dim).withColumnRenamed("SortOrder", f"{dim}SortOrder")
denorm_df_nulls = denorm_df.filter(col("Age").isNull() | col("Area").isNull() | col("Ethnic").isNull() | col("Sex").isNull() | col("Year").isNull())
display(denorm_df_nulls.limit(10))
YearCodeAgeCodeEthnicCodeSexCodeAreaCodecountAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
20188881001001150Median age2nullnullnullnullMale220183
20188881001001248.5Median age2nullnullnullnullMale220183
20188881001001949.2Median age2nullnullnullnullMale220183
20188881001002129.5Median age2nullnullnullnullFemale320183
20188881001002242.3Median age2nullnullnullnullFemale320183
20188881001002936Median age2nullnullnullnullFemale320183
20188881001003121.2Median age2nullnullnullnullnullnull20183
20188881001003224.9Median age2nullnullnullnullnullnull20183
20188881001003923.3Median age2nullnullnullnullnullnull20183
20188881001004126.6Median age2nullnullnullnullnullnull20183

The area code doesn’t match.

Let’s dig deeper:

display(spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupArea8317.csv").filter(col("Code").isin(["1", "01"])))
CodeDescriptionSortOrder
01Northland Region4

Seems like some of the area codes don’t lookup correctly.

Let’s also look at sex:

display(denorm_df_nulls.limit(10))
YearCodeAgeCodeEthnicCodeSexCodeAreaCodecountAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
20188881001001150Median age2nullnullnullnullMale220183
20188881001001248.5Median age2nullnullnullnullMale220183
20188881001001949.2Median age2nullnullnullnullMale220183
20188881001002129.5Median age2nullnullnullnullFemale320183
20188881001002242.3Median age2nullnullnullnullFemale320183
20188881001002936Median age2nullnullnullnullFemale320183
20188881001003121.2Median age2nullnullnullnullnullnull20183
20188881001003224.9Median age2nullnullnullnullnullnull20183
20188881001003923.3Median age2nullnullnullnullnullnull20183
20188881001004126.6Median age2nullnullnullnullnullnull20183
display(spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookupSex8317.csv"))
CodeDescriptionSortOrder
9Total people - sex1
1Male2
2Female3

Conclusion: the data looks a little odd and needs investigating further. For now, let’s continue with the denormalisation as we can fix these issues later by keeping the raw CSVs.

Creating output files, Hive databases and tables

Section titled “Creating output files, Hive databases and tables”

When writing out DataFrames, we can write them out in various formats, and optionally add a Hive table over these files.

Hive tables are ‘virtual’ SQL tables over data stored (in this case stored on ADLS). We can use either:

  • External tables: create tables over existing data
  • Hive-managed tables: create tables and data at the same time

Both options produce the same end, and are only subtly different.

When we create Hive tables, we are really writing out the DataFrame to ADLS and adding a schema and file path reference to Hive.

Let’s create a Hive database:

%sql
create database if not exists sandbox;
use sandbox;

Now, let’s create our final DataFrame we would like to write out:

denorm_df = spark.read.option("header", True).csv("/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/Data8317.csv")
for dim in ["Age", "Area", "Ethnic", "Sex", "Year"]:
dim_df = spark.read.option("header", True).csv(f"/data/raw/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/DimenLookup{dim}8317.csv")
denorm_df = denorm_df.join(dim_df, col(dim) == col("Code"), how="left").drop("Code").withColumnRenamed(dim, f"{dim}Code").withColumnRenamed("Description", dim).withColumnRenamed("SortOrder", f"{dim}SortOrder")

We can write the files out directly as Parquet (with no Hive table):

denorm_df.write.mode("overwrite").parquet("/data/derived/stats_nz_census/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz_denorm/")

Or we can write the files out (by default in Parquet) and create a Hive table:

denorm_df.write.mode("overwrite").saveAsTable("sandbox.age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz")

We can now query the Hive table using SQL:

%sql
select * from sandbox.age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz
limit 10
YearCodeAgeCodeEthnicCodeSexCodeAreaCodecountAgeAgeSortOrderAreaAreaSortOrderEthnicEthnicSortOrderSexSexSortOrderYearYearSortOrder
20181106191203000110 years138West Harbour Clearwater Cove317New Zealander10Total people - sex120183
2018110691120300..C110 years138West Harbour Clearwater Cove317Other ethnicity nec11Male220183
2018110692120300..C110 years138West Harbour Clearwater Cove317Other ethnicity nec11Female320183
20181106991203000110 years138West Harbour Clearwater Cove317Other ethnicity nec11Total people - sex120183
2018110611120400..C110 years138Unsworth Heights East318New Zealander10Male220183
2018110612120400..C110 years138Unsworth Heights East318New Zealander10Female320183
20181106191204000110 years138Unsworth Heights East318New Zealander10Total people - sex120183
2018110691120400..C110 years138Unsworth Heights East318Other ethnicity nec11Male220183
2018110692120400..C110 years138Unsworth Heights East318Other ethnicity nec11Female320183
20181106991204000110 years138Unsworth Heights East318Other ethnicity nec11Total people - sex120183

And we have three ways of opening the Hive table as a DataFrame:

df_s = spark.sql("select * from sandbox.age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz")
df_h = spark.read.table("sandbox.age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz")
df_p = spark.read.parquet("/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/")

The underlying Hive-backed table metadata and data files look like the following:

%sql
describe formatted sandbox.age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz;
col_namedata_typecomment
YearCodestringnull
AgeCodestringnull
EthnicCodestringnull
SexCodestringnull
AreaCodestringnull
countstringnull
Agestringnull
AgeSortOrderstringnull
Areastringnull
AreaSortOrderstringnull
Ethnicstringnull
EthnicSortOrderstringnull
Sexstringnull
SexSortOrderstringnull
Yearstringnull
YearSortOrderstringnull
# Detailed Table Information
Databasesandbox
Tableage_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz
Ownerroot
Created TimeThu Nov 12 02:32:15 UTC 2020
Last AccessUNKNOWN
Created BySpark 3.0.0
TypeMANAGED
Providerparquet
Locationabfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz
Serde Libraryorg.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormatorg.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
display(dbutils.fs.ls("/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/"))
pathnamesize
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/_SUCCESS_SUCCESS0
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/_committed_5626976569612752113_committed_5626976569612752113724
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/_started_5626976569612752113_started_56269765696127521130
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00000-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-247-1-c000.snappy.parquetpart-00000-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-247-1-c000.snappy.parquet5614085
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00001-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-248-1-c000.snappy.parquetpart-00001-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-248-1-c000.snappy.parquet5036655
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00002-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-249-1-c000.snappy.parquetpart-00002-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-249-1-c000.snappy.parquet9889226
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00003-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-250-1-c000.snappy.parquetpart-00003-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-250-1-c000.snappy.parquet8736544
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00004-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-251-1-c000.snappy.parquetpart-00004-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-251-1-c000.snappy.parquet9473689
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00005-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-252-1-c000.snappy.parquetpart-00005-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-252-1-c000.snappy.parquet9039443
abfss://sandbox@aueprddlsnzlh001.dfs.core.windows.net/tables/sandbox.db/age_and_sex_by_ethnic_group_census_night_population_counts_2006_2013_2018_nz/part-00006-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-253-1-c000.snappy.parquetpart-00006-tid-5626976569612752113-51097d54-ac44-42e0-b210-cf5d6502cbf9-253-1-c000.snappy.parquet2995091

We now have a virtual Hive table we can query using SQL, and we can create a DataFrame using an SQL query, a reference to the Hive table or by reading the underlying Parquet files.

And since the underlying files are Snappy-compressed Parquet, the underlying filesize has gone from 800Mb CSVs (100Mb compressed) to 50Mb Parquet files (even though we denormalised!).