25 April 2024

Graphical Representation: Graphics We Live By (Part III: Exchange Rates)

An exchange rate (XR) is the rate at which one currency will be exchanged for another currency, and thus XRs are used in everything related to trades, several processes in Finance relying on them. There are various sources for the XR like the European Central Bank (ECB) that provide the row data and various analyses including graphical representations varying in complexity. Conversely, XRs' processing offers some opportunities for learning techniques for data visualization. 

On ECB there are monthlyyearly, daily and biannually XRs from EUR to the various currencies which by triangulation allow to create XRs for any of the currencies involved. If N currencies are involved for one time unit in the process (e.g. N-1 XRs) , the triangulation generates NxN values for only one time division, the result being tedious to navigate. A matrix like the one below facilitates identifying the value between any of the currencies:


The table needs to be multiplied by 12, the number of months, respectively by the number of years, and filter allowing to navigate the data as needed. For many operations is just needed to look use the EX for a given time division. There are however operations in which is needed to have a deeper understanding of one or more XR's evolution over time (e.g. GBP to NOK). 

Moreover, for some operations is enough to work with two decimals, while for others one needs to use up to 6 or even more decimals for each XR. Occasionally, one can compromise and use 3 decimals, which should be enough for most of the scenarios. Making sense of such numbers is not easy for most of us, especially when is needed to compare at first sight values across multiple columns. Summary tables can help:

Statistics like Min. (minimum), Max. (maximum), Max. - Min. (range), Avg. (average) or even StdDev. (standard deviation) can provide some basis for further analysis, while sparklines are ideal for showing trends over a time interval (e.g. months).

Usually, a heatmap helps to some degree to navigate the data, especially when there's a plot associated with it:

In this case filtering by column in the heatmap allows to see how an XR changed for the same month over the years, while the trendline allows to identify the overall tendency (which is sensitive to the number of years considered). Showing tendencies or patterns for the same month over several years complements the yearly perspective shown via sparklines.

Fortunately, there are techniques to reduce the representational complexity of such numbers. For example, one can use as basis the XRs for January (see Base Jan), and represent the other XRs only as differences from the respective XR. Thus, in the below table for February is shown the XR difference between February and January (13.32-13.22=0.10). The column for January is zero and could be omitted, though it can still be useful in further calculations (e.g. in the calculation of averages) based on the respective data..

This technique works when the variations are relatively small (e.g. the values vary around 0). The above plots show the respective differences for the whole year, respectively only for four months. Given a bigger sequence (e.g. 24, 28 months) one can attempt to use the same technique, though there's a point beyond which it becomes difficult to make sense of the results. One can also use the year end XR or even the yearly average for the same, though it adds unnecessary complexity to the calculations when the values for the whole year aren't available. 

Usually, it's recommended to show only 3-5 series in a plot, as one can better distinguish the trends. However, plotting all series allows to grasp the overall pattern, if any. Thus, in the first plot is not important to identify the individual series but to see their tendencies. The two perspectives can be aggregated into one plot obtained by applying different filtering. 

Of course, a similar perspective can be obtained by looking at the whole XRs:

The Max.-Min. and StdDev (standard deviation for population) between the last and previous tables must match. 

Certain operations require comparing the trends of two currencies. The first plot shows the evolution NOK and SEK in respect to EUR, while the second shows only the differences between the two XRs:


The first plot will show different values when performed against other currency (e.g. USD), however the second plot will look similarly, even if the points deviate slightly:

Another important difference is the one between monthly and yearly XRs, difference depicted by the below plot:

The value differences between the two XR types can have considerable impact on reporting. Therefore, one must reflect in analyses the rate type used in the actual process. 

Attempting to project data into the future can require complex techniques, however, sometimes is enough to highlight a probable area, which depends also on the confidence interval (e.g. 85%) and the forecast length (e.g. 10 months):

Every perspective into the data tends to provide something new that helps in sense-making. For some users the first table with flexible filtering (e.g. time unit, currency type, currency from/to) is enough, while for others multiple perspectives are needed. When possible, one should  allow users to explore the various perspectives and use the feedback to remove or even add more perspectives. Including a feedback loop in graphical representation is important not only for tailoring the visuals to users' needs but also for managing their expectations,  respectively of learning what works and what doesn't.

Comments:
1) I used GBP to NOK XRs to provide an example based on  triangulation.
2) Some experts advise against using borders or grid lines. Borders, as the name indicates allow to delimitate between various areas, while grid lines allow to make comparisons within a section without needing to sway between broader areas, adding thus precision to our senses-making. Choosing grey as color for the elements from the background minimizes the overhead for coping with more information while allowing to better use the available space.
3) Trend lines are recommended where the number of points is relatively small and only one series is involved, though, as always, there are exceptions too. 
4) In heatmaps one can use a gradient between two colors to show the tendencies of moving toward an extreme or another. One should avoid colors like red or green.
5) Ideally, a color should be used for only one encoding (e.g. one color for the same month across all graphics), though the more elements need to be encoded, the more difficult it becomes to respect this rule. The above graphics might slightly deviate from this as the purpose is to show a representation technique. 
6) In some graphics the XRs are displayed only with two decimals because currently the technique used (visual calculations) doesn't support formatting.
7) All the above graphical elements are based on a Power BI solution. Unfortunately, the tool has its representational limitations, especially when one wants to add additional information into the plots. 
8) Unfortunately, the daily XR values are not easily available from the same source. There are special scenarios for which a daily, hourly or even minute-based analysis is needed.
9) It's a good idea to validate the results against the similar results available on the web (see the ECB website).

Previous Post <<||>> Next Post

20 April 2024

Power BI: Visual Calculations (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 20-Apr-2024

[feature] Visual Calculations (aka Visual Calcs)

  • {definition} a type of DAX calculation that's defined and executed directly on a visual [1]
  • {benefit} make it easier to create calculations that were previously hard to create
    • leads to simpler DAX, easier maintenance, and better performance [1]
    • reuse the results from its components [2]
  • still in public preview
    • ⇒ must be enabled manually
  • are stored on the visual
    • ⇒ can refer to any data in the visual 
      • incl. columns, measures, or other visual calculations [1]
    • ⇒ anything in the model must be added to the visual before the visual calculation can refer to it [1]
    •  ⇒ they can refer to the visual structure
      • ⇒ leads to more flexibility
  • combine the simplicity of context from calculated columns with the on-demand calculation flexibility from measures [1]
    • ⇐ the context is "visible"
  • operate on aggregated data instead of the detail level [1]
    • ⇒ leads to performance benefits
  • introduce a set of functions specific to visual calculations [1]
    • {category} medium-level functions
      • {function} COLLAPSE
        • the calculation is evaluated at a higher level of the axis [1]
      • {function} COLLAPSEALL
        • the calculation is evaluated at the total level of the axis [1]
      • {function} EXPAND
        • the calculation is evaluated at a lower level of the axis [1]
      • {function}EXPANDALL
        • the calculation is evaluated at the leaf level of the axis [1]
      • {function} FIRST
        • refers to the first row of an axis [1]
      • {function} ISATLEVEL
        • reports whether a specified column is present at the current level.
      • {function} LAST
        • refers to the last row of an axis [1]
      • {function} NEXT
        • refers to a next row of an axis [1]
      • {function} PREVIOUS
        • refers to a previous row of an axis [1]
      • {function} RANGE
        • refers to a slice of rows of an axis [1]
    • {category} high-level functions
      • {function} MOVINGAVERAGE
        • adds a moving average on an axis [1]
      • {function} RUNNINGSUM
        • adds a running sum on an axis  [1]
    • {category} low-level functions
      • ⇐ {exception} are available in standard DAX
      • {function} INDEX
      • {function} OFFSET
      • {function} RANK
      • {function} ROWNUMBER
      • {function}WINDOW
      • {function} ORDERBY
      • {function} PARTITIONBY
      • {function} MATCHBY
  • {default} most of them are evaluated row-by-row [1]
    • ⇐ like a calculated column
    • there's no need to add an aggregation function [1]
      •  it's better not to add such aggregates when they're not necessary [1]
  • {operation} create calculation
    • adds the visual calculation to the visual
  • {operation} hide calculation
    • calculations that aren't needed in the visual can be hidden [2]
  • {operation} copy calculation
    • copies the calculation between visuals and if intermediary steps are not there, they will be copied as well [planned] [2]
  • {feature} templates 
    • ready available calculation constructs 
    • {benefit} make it easier to write common calculations [1]
  • {feature|planned} support for Scanner API [2]
  • {parameter} axis 
    • influences how the visual calculation traverses the visual matrix [1]
    • {default} set to the first axis in the visual
    • {value} ROWS
      • the visual calculation is evaluated row-by-row in the visual matrix, from top to bottom. [1]
    • {value} COLUMNS
      • the visual calculation is evaluated row-by-row in the visual matrix, from left to right [1]
    • {value} ROWS COLUMNS
      • calculates vertically across rows from top to bottom, continuing column by column from left to right [1]
    • {value} COLUMNS ROWS
      • calculates horizontally across columns from left to right, continuing row by row from top to bottom [1]
    • {warning} not all visuals provide all axes, and some visuals provide no axes [1]
  • {parameter| reset 
    • influences if and when the function resets its value to 0 or switches to a different scope while traversing the visual matrix [1]
    • expects there to be multiple levels on the axis [1]
      • ⇐ use PARTITIONBY if there's only one level on the axis [1]
    • {value|default} NONE
      • means the visual calculation is never restarted [1]
    • {value} HIGHESTPARENT 
      • resets the calculation when the value of the highest parent on the axis changes [1]
    • {value} LOWESTPARENT 
      • resets the calculations when the value of the lowest parent on the axis changes [1]
    • {value} numerical value
      • refers to the fields on the axis, with the highest field being one [1]
  • {limitation} functions that rely on model relationships  aren't available
    • e.g. USERELATIONSHIP, RELATED or RELATEDTABLE
  • {limitation} not all visual types are supported [1]
    • ⇐ for the full list of limitations see [1]
  • {limitation} one can't filter on visual calculations [1]
  • {limitation} underlying data can't be exported [1]
  • {limitation} don't support conditional formatting

References
[1] Microsoft Learn (2024) Power BI: Using visual calculations [preview] (link)
[2] SSBI Central (2024) Visual Calculations - Making DAX easier, with Jeroen ter Heerdt(link)

19 April 2024

Power BI: Working with Visual Calculations (new feature)

Introduction

I recently watched a webcast with Jeroen (Jay) ter Heerdt (see [2]) in which he introduces visual calculations, a type of DAX calculation that's defined and executed directly on a visual [1]. Visual calculations provide an approach of treating a set of data much like an Excel table, allowing to refer to any field available on a visual and write formulas, which simplifies considerably the solutions used currently for ranking records, running averages and other windowing functions. 

The records behind a visual can be mentally represented as a matrix, while the visual calculations can refer to any column from the matrix, allowing to add new columns and include the respective columns in further calculations. Moreover, if a column is used in a formula, it's not recalculated as is the case of measures, which should improve the performance of DAX formulas considerably. 

Currently, one can copy a formula between visuals and if the formula contains fields not available in the targeted visual, they are added as well. Conversely, it's possible to build such a visual, copy it and then replace the dimension on which the analysis is made (e.g. Customer with Product), without being needed to make further changes. Unfortunately, there are also downsides: (1) the calculations are visible only within the visual in which were defined; (2) currently, the visual's data can't be exported if a visual calculation is added; (3) no formatting is supported, etc.

Ranking and Differences

I started to build a solution based on publicly available sales data, which offers a good basis for testing the use of visual calculations. Based on a Power BI visual table made of [Customer Name], [Sales Amount], [Revenue] and [Total Discount], I've added several calculations:

-- percentages
Sales % = 100*DIVIDE([Sales Amount], COLLAPSE([Sales Amount], ROWS))
Revenue % = 100*DIVIDE([Revenue],[Sales Amount])
Discount % = 100*DIVIDE([Total  Discount], [Total  Discount]+[Sales Amount])

-- rankings 
Rank Sales = Rank(DENSE, ORDERBY([Sales Amount], DESC))
Rank Revenue = Rank(DENSE, ORDERBY([Revenue], DESC))

-- differences between consecutive values
Diff. to Prev. Sales = IF([Rank Sales]>1, INDEX([Rank Sales]-1, , ORDERBY([Sales Amount], DESC)) - [Sales Amount] , BLANK())
Diff. to Prev. Rev. = IF([Rank Revenue]>1, INDEX([Rank Revenue]-1, , ORDERBY([Revenue], DESC)) - [Revenue] , BLANK())

Here's the output considered only for the first 10 records sorted by [Sales Amount]:

Customer Name Sales Amount Sales % Revenue Revenue % Total Discount Discount % Rank Sales Diff. to Prev. Sales. Rank Rev. Diff. to Prev. Rev.
Medline 1058923.78 3.76 307761.99 3.75 126601.02 10.68 1 1
Ei 707663.21 2.51 229866.98 2.8 95124.09 11.85 2 351260.57 2 77895.01
Elorac, Corp 702911.91 2.49 209078.76 2.55 83192.39 10.58 3 4751.3 6 20788.22
Sundial 694918.98 2.47 213362.1 2.6 78401.72 10.14 4 7992.93 4 -4283.34
OUR Ltd 691687.4 2.45 196396.26 2.4 78732.2 10.22 5 3231.58 10 16965.84
Eminence Corp 681612.78 2.42 213002.78 2.6 86904.03 11.31 6 10074.62 5 -16606.52
Apotheca, Ltd 667283.99 2.37 157435.56 1.92 101453.91 13.2 7 14328.79 31 55567.22
Rochester Ltd 662943.9 2.35 224918.2 2.74 81158.11 10.91 8 4340.09 3 -67482.64
ETUDE Ltd 658370.48 2.34 205432.79 2.51 89322.72 11.95 9 4573.42 9 19485.41
Llorens Ltd 646779.31 2.29 206567.4 2.52 82897.59 11.36 10 11591.17 8 -1134.61

Comments:
1) One could use [Total Amount] = [Total  Discount]+[Sales Amount] as a separate column.
2) The [Rank Sales] is different from the [Rank Rev.] because of the discount applied.
3) In the last two formulas a blank was considered for the first item from the ranking.
4) It's not possible to control when the totals should be displayed, however one can change the color for the not needed total to match the background.

Visualizing Differences 

Once the formulas are added, one can hide the basis columns and visualize the data as needed. To obtain the below chart I copied the visual and changed the column as follows:

Diff. to Prev. Rev. = IF([Rank Revenue]>1, [Revenue]- INDEX([Rank Revenue]-1, , ORDERBY([Revenue], DESC)) , [Revenue]) -- modified column

Differences Revenue between Customers

Comments:
1) Instead of showing the full revenue, the chart shows only the differences from the highest revenue, where the column in green is the highest revenue, while the columns in red are the differences of the current customer's revenue to the previous customer, as the data are sorted by the highest revenue. At least in this case it results in a lower data-ink ratio (see Tufte).
2) The values are sorted by the [Revenue] descending. 
3) Unfortunately, it's not possible to change the names from the legend.

Simple Moving Averages (SMAs)

Based on the [Sales Amount], [Revenue] and [Month] one can add the following DAX formulas to the table for calculating the SMA:

Sales Amount (SMA) = MOVINGAVERAGE([Sales Amount],6)
Revenue (SMA) = MOVINGAVERAGE([Revenue],6)

The chart becomes:


Comments:
1) Unfortunately, the formula can't project the values into the feature, at least not without the proper dates.
2) "Show items with not data" feature seems to be disabled when visual calculations are used.
3) The SMA was created via a template formula. Similarly, calculating a running sum is reduced to applying a formula:
Running Sales Amount = RUNNINGSUM([Sales Amount])

Wrap Up

It's easier to start with a table for the visual, construct the needed formulas and then use the proper visual while eliminating the not needed fields. 

The feature is still in public preview and changes can still occur. Unfortunately, there's still no information available on the general availability date. From the first tests, it provides considerable power with a minimum of effort, which is great! I don't want to think how long I would have needed to obtain the same results without it!

Happy coding!

References
[1] Microsoft Learn (2024) Power BI: Using visual calculations [preview] (link)
[2] SSBI Central (2024) Visual Calculations - Making DAX easier, with Jeroen ter Heerdt (link)

Power BI: Preparatory Steps for Creating a Power BI Report

When creating a Power BI report consider the following steps when starting the actual work. The first five steps can be saved to a "template" that can be reused as starting point for each report.

Step 0: Check Power BI Desktop's version

Check whether you have the latest version, otherwise you can download it from the Microsoft website.
Given that most of the documentation, books and other resources are in English, it might be a good idea to install the English version.

Step 1: Enable the recommended options

File >> Options and settings >> Options >> Global >> Data Load:
.>> Time intelligence >> Auto date/time for new files >> (uncheck)
.>> Regional settings >> Application language >> set to English (United States)
.>> Regional settings >> Model language >> set to English (United States)

You can consider upon case also the following options (e.g. when the relationships are more complex than the feature can handle):
File >> Options and settings >> Options >> Current >> Data load:
.>> Relationship >> Import relationships from data sources on first load >> (uncheck)
.>> Relationship >> Autodetect new relationships after data is loaded >> (uncheck)

Step 2: Enable the options needed by the report

For example, you can enable visual calculations:
File >> Options and settings >> Options >> Preview features >> Visual calculations >> (check)

Comment:
Given that not all preview features are stable enough, instead of activating several features at once, it might be a good idea to do it individually and test first whether they work as expected. 

Step 3: Add a table for managing the measures

Add a new table (e.g. "dummy" with one column "OK"):
Results = ROW("dummy", "OK")

Add a dummy measure:
Test = ""

Hide the "OK" column and with this the table is moved to the top. The measures can be further organized within folders for easier maintenance. 

Step 4: Add the Calendar if time analysis is needed

Add a new table (e.g. "Calendar" with a "Date" column):
Calendar = Calendar(Date(Year(Today()-3*365),1,1),Date(Year(Today()+1*365),12,31))

Add the columns:
Year = Year('Calendar'[Date])
YearQuarter = 'Calendar'[Year] & "-Q" & 'Calendar'[Quarter]
Quarter = Quarter('Calendar'[Date])
QuarterName = "Q" & Quarter('Calendar'[Date])
Month = Month('Calendar'[Date])
MonthName = FORMAT('Calendar'[Date], "mmm")

Format the Date in a standard format (e.g. dd-mmm-yy) including for Date/Time for which the Time is not needed.

To get the values in the visual sorted by the MonthName:
Table view >> (select MonthName) >> Column tools >> Sort by column >> (select Month)

To get the values in the visual sorted by the QuarterName:
Table view >> (select QuarterName) >> Column tools >> Sort by column >> (select Quarter)

With these changes the filter could look like this:


Step 5: Add the corporate/personal theme

Consider using a corporate/personal theme at this stage. Without this the volume of work that needs to be done later can increase considerably. 

There are also themes generators, e.g. see powerbitips.com, a tool that simplifies the process of creating complex theme files. The tool is free however, users can save their theme files via a subscription service.

Set canvas settings (e.g. 1080 x 1920 pixels).

Step 6: Get the data

Consider the appropriate connectors for getting the data into the report. 

Step 7: Set/Validate the relationships

Check whether the relationships between tables set by default are correct, respectively set the relationships accordingly.

Step 8: Optimize the data model

Look for ways to optimize the data model.

Step 9: Apply the formatting

Format numeric values to represent their precision accordingly.
Format the dates in a standard format (e.g. "dd-mmm-yy") including for Date/Time for which the Time is not needed.

Step 10: Define the filters

Identify the filters that will be used more likely in pages and use the Sync slicers to synchronize the filters between pages, when appropriate:
View >> Sync slicers >> (select Page name) >> (check Synch) >> (check Visible)

Step 11: Add the visuals

At least for report's validation, consider using a visual that holds the detail data as represented in the other visuals on the page. Besides the fact that it allows users to validate the report, it also provides transparence, which facilitates report's adoption. 

18 April 2024

Data Warehousing: Data(base) Mirroring in Microsoft Fabric (New feature)

Data Warehousing
Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon. 

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work. 

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed. 

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done. 

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]. 

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

Microsoft Fabric: Mirroring (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 18-Apr-2024

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

Mirroring

  • low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]
    • supported in Azure SQL Database, Azure Cosmos DB, Snowflake [1]
    • ⇒ replaces ETL pipelines
    • ⇐ manages the replication of data into OneLake and conversion to Parquet [1]
    • changes have a near real-time latency [3]
    • uses the source database’s CDC feature [4]
  • {benefit} allows users to use a highly integrated, end-to-end, and easy-to-use service [1]
    • the mirrored data can be used in Power BI
      • ⇐ since tables are all v-ordered delta tables [3]
  • {benefit} the delta tables can then be used in every Fabric experience, allowing users to accelerate their journey into Fabric [1]
    • {restriction} replication of views is currently not supported [3]
  • {benefit} allows to write cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of Lakehouses in a single T-SQL query [1]
  • landing zone in OneLake 
    • stores both the snapshot and change data [3]
      • ⇐ improves performance when converting files into delta verti-parquet [3]
  • {operation} enable the mirroring 
    • by creating a secure connection to the operational data source
    • an entire database or individual tables can be replicated 
    • creates three items in the targeted workspace:
      • the mirrored database
      • a SQL analytics endpoint
      • a Default semantic model
    • the same source database can be mirrored multiple times
      • ⇐ though usually not needed
        • scenario: replicating data to different types of environments
  • {operation} stopping the mirroring
    • stops the replication in the source database, but a copy of the tables is kept in OneLake [3]
  • {operation} restarting the mirroring 
    • results in all data being replicated from the start [3]
  • {operation} remove a table from mirroring
    • the table is no longer replicated and its data is deleted from OneLake [3]
  • {operation} sharing
    • users grant other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]
      • access is also granted to the SQL analytics endpoint and associated default semantic model [1]
      • shared mirrored databases can be found through either
        • Data Hub
        • Shared with Me section in Microsoft Fabric
    • triggers an initial replication
  • {requirement} [licensing] requires Power BI Premium, Fabric Capacity, or Trial Capacity
  • {feature} monitoring
    • {benefit} allows to gain insights into mirroring operations and when the replica in Fabric OneLake was last refreshed [4]
  • [Azure SQL Database] 
    • {restriction} doesn't support Azure SQL Database logical servers behind an AVN or private networking [2]
      • {requirement} update the Azure SQL logical server firewall rules to Allow public network access [2]
      • {requirement} enable the Allow Azure services option to connect to your Azure SQL Database logical server [2]
    • {restriction}
      •  access through the Power BI Gateway or behind a firewall is unsupported [3]
    • authentication
      • SQL authentication with user name and password
      • Microsoft Entra ID
      • Service Principal
    • {troubleshooting} check if the changes properly flow
      • via sys.dm_change_feed_log_scan_sessions DMV
    • {troubleshooting} check if there are any problems reported
      • via sys.dm_change_feed_errors DMV
    • {troubleshooting} check if the mirroring was properly enabled
      • via sp_help_change_feed stored procedure 
    • {troubleshooting} disable mirroring
      • via sp_change_feed_disable_db stored procedure
Acronyms:
AVN - Azure Virtual Network
CDC - Change Data Capture
ETL - Extract, Transfer, Load
DMV - Dynamic Management View

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2023) (link)

Resources:
[1] Tutorial: Configure Microsoft Fabric mirrored databases from Azure SQL Database [Preview] (link)
[2] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

13 April 2024

R Language: Using the lessR Package in Microsoft Fabric's Notebooks (Test Drive)

I've started to use again the R languages for data visualizations. Discovering the lessR package, which simplifies considerably the verbose syntax of the R language by encapsulating the functionality behind simple functions, I wondered whether it can be installed in Microsoft Fabric and used from notebooks. Besides the available documentation, for learning I used also David W Berging's book on R visualizations [3].

Into a new notebook, I used one cell for each installation or package retrieval (see [1], [2]):

#installing packages
#install.packages("tidyverse") #is in Microsoft Fabric preinstalled
install.packages("lessR")

#retrieve packages from library
library("tidyverse")
library("lessR")

I attempted to read the data from a http location via the lessR Read function and it worked 

d <- Read("http://lessRstats.com/data/employee.xlsx")

head(d)

However, attempting to use any of the lessR functions used for visualization displayed only the text output and not the visualizations. No matter what I did - suppressing the text, suppressing the generation of PDF files, the result was the same. It seems to be a problem with the output device, though I'm not sure how to solve this yet. 

# supressing the text
style(quiet=TRUE)

# reenabling the text
style(quiet=FALSE)

# supressing PDF  generation
pdf(NULL)

# retrieving current device used 
options()$device

I was able to run the ggplot2 scripts from [3] though only when the lessR was also installed (each script should be run in its own cell, otherwise only the last plot is shown):

# bard charts 
ggplot(d) + geom_bar(aes(Dept)) 

# histogram
ggplot(d, aes(Salary)) + geom_histogram(binwidth=10000) 

# integrated violin/box/scatterplot
ggplot(d, aes(x="", y=Salary)) +
geom_violin(fill="gray90", bw=9500, alpha=.3) +
geom_boxplot(fill="gray75", outlier.color="black", width=0.25) +
geom_jitter(shape=16, position=position_jitter(0.05)) +
theme(axis.title.y=element_blank()) +
coord_flip()

# enhanced scatterplot 
ggplot(d, aes(Years, Salary)) + geom_point() +
geom_smooth(method=lm, color="black") +
stat_ellipse(type="norm") +
geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") +
geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")

Similar results could be obtained by using the following lessR syntax in RStudio:

# bard charts 
BarChart(Dept)

# histogram
Histogram(Salary) 

# integrated violin/box/scatterplot
Plot(Salary)

# enhanced scatterplot 
Plot(Years, Salary, enhance=TRUE)

Trying to see whether I can access the data from a lakehouse via SparkR, I've downloaded the file from the support website [3], loaded the data into an available lakehouse (e.g. UAT), respectiveley loaded the data to a new table:

-- creating the table
CREATE TABLE [dbo].[employee](
	[Name] [varchar](8000) NULL,
	[Years] [int] NULL,
	[Gender] [varchar](8000) NULL,
	[Dept] [varchar](8000) NULL,
	[Salary] [float] NULL,
	[JobSat] [varchar](8000) NULL,
	[Plan] [int] NULL,
	[Pre] [int] NULL,
	[Post] [int] NULL
) ON [PRIMARY]
GO

-- checking the data
SELECT *
FROM [dbo].[employee]

I was able to access the content of the imported file via the following script:

#access the file from lakehouse
#csv_file <- "https://onelake.dfs.fabric.microsoft.com/<file_system>/<account_name>/Files/OpenSource/employee.csv"
#csv_file <- "abfss://<file_system>.dfs.fabric.microsoft.com/<account_name>/Files/OpenSource/employee.csv"

csv_file <- "Files/OpenSource/employee.csv"

df <- read.df(csv_file, source= "csv", header = "true", inferSchema = "true")

display(df)

Initially, I wasn't able to access the table directly, though in the end I was able to retrieve the data (without and with the catalog's name):

# creating a data frame via SparkSQL
dfEmp <- sql("SELECT * FROM Employee")

head(dfEmp)

Comments:
1) Once the sessions timeout, it seems that one needs to rerun the scripts, which proves to be time-consuming as the installation takes about 5 minutes. 
2) Being able to use lessR directly in Microsoft Fabric could be a real win given its simple syntax. I run most of the tests from the book [3] plus some of the recommended scripts and the results are satisfactory. 
3) The connection via the ABFS path to the lakehouse works as well, but not via URL. 

References:
[1] Microsoft Learn - Microsoft Fabric (2023) R library management(link)
[2] lessR (2024) Data (link)
[3] David W Gerbing (2020) R Visualizations: Derive Meaning from Data
[4] CRAN=R (2024) Package lassR (link)

10 April 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Import data from CSV Files into Delta Table

Microsoft provides a set of labs and exercises that can be used to learn working with data in Fabric, however the real learning comes when one considers an example that introduces something new. As I've downloaded some time ago an archive with several datasets on Sales forecast from the Kaggle website, I tried to import the Features dataset in different ways and see how it goes.

Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
1,2010-02-05,42.31,2.572,NA,NA,NA,NA,NA,211.0963582,8.106,FALSE
1,2010-02-12,38.51,2.548,NA,NA,NA,NA,NA,211.2421698,8.106,TRUE
1,2010-02-19,39.93,2.514,NA,NA,NA,NA,NA,211.2891429,8.106,FALSE
1,2010-02-26,46.63,2.561,NA,NA,NA,NA,NA,211.3196429,8.106,FALSE
1,2010-03-05,46.5,2.625,NA,NA,NA,NA,NA,211.3501429,8.106,FALSE

Within an existing lakehouse, one can import the CSV as it is via 'Files/Upload' and once the data is imported, once can navigate to the file and use 'Load to Tables/New Table' to important the data into a managed table. Unfortunately, because some of the numeric fields include also literal values "NA" for the columns with NULLs, their data type is considered as varchar(8000), which is not ideal for calculations:

-- table created via Load to Tables
CREATE TABLE [dbo].[walmart_features](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [float] NULL,
	[Fuel_Price] [float] NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]

This could be fixed by replacing the NA values with an empty value, which I did and used this version for the next steps. 

I tried then using Spark to import the data, though then all the fields are defined as varchar(8000).

-- table created via Spark
CREATE TABLE [dbo].[walmart_features2](
	[Store] [varchar](8000) NULL,
	[Date] [varchar](8000) NULL,
	[Temperature] [varchar](8000) NULL,
	[Fuel_Price] [varchar](8000) NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [varchar](8000) NULL
) ON [PRIMARY]
GO

So, is needed to define the schema explicitly, however I had to import the IsHoliday as string and cast the value explicitly to a Boolean using a second data frame (see alternatives):

from pyspark.sql.types import *
from pyspark.sql.functions import *

#define schema
featuresSchema = StructType([
      StructField("Store", IntegerType())
    , StructField("Date", DateType())
    , StructField("Temperature",  DecimalType(13,2))
    , StructField("Fuel_Price", DecimalType(13,2))
    , StructField("MarkDown1", DecimalType(13,2))
    , StructField("MarkDown2", DecimalType(13,2))
    , StructField("MarkDown3", DecimalType(13,2))
    , StructField("MarkDown4", DecimalType(13,2))
    , StructField("MarkDown5", DecimalType(13,2))
    , StructField("CPI", DecimalType(18,6))
    , StructField("Unemployment", DecimalType(13,2))
    , StructField("IsHoliday", StringType())
])

# Load a file into a dataframe
df = spark.read.load('Files/OpenSource/features2.csv'
    , format='csv'
    , schema = featuresSchema
    , header=True)

# do the conversion for isHoliday
df2 = df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType())) # Save the dataframe as a delta table df2.write.format("delta").saveAsTable("walmart_features3")

Now, table's definition looks much better:

-- table created via Spark with explicit schema
CREATE TABLE [dbo].[walmart_features3](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [decimal](13, 2) NULL,
	[Fuel_Price] [decimal](13, 2) NULL,
	[MarkDown1] [decimal](13, 2) NULL,
	[MarkDown2] [decimal](13, 2) NULL,
	[MarkDown3] [decimal](13, 2) NULL,
	[MarkDown4] [decimal](13, 2) NULL,
	[MarkDown5] [decimal](13, 2) NULL,
	[CPI] [decimal](18, 6) NULL,
	[Unemployment] [decimal](13, 2) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]
GO

Comments:
(1) I tried to apply the schema change directly on the initial data frame, though the schema didn't change:

df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType()))

(2) For the third method one could have left the NA in because by the conversion a NULL will be considered. Conversely, it might be needed to check if there are other values that fail the conversion. 

(3) The following warning in the Notebook when running the above code is a hint that something went wrong during the conversion (e.g. decimals were cut): 

"Your file(s) might include corrupted records"

(4) Especially for the transformed values it makes sense to look at the values (at last when the dataset isn't too big):

-- validating the values for the Boolean data field
SELECT IsHoliday
, count(*) NoRecords
FROM dbo.walmart_features3
GROUP BY IsHoliday
ORDER BY 1

(5) The tables can be deleted directly in the lakehouse or via PySpark (observe the catalog.table_name):

#dropping the table
spark.sql('DROP TABLE IF EXISTS UAT.walmart_features3') 

At the beginning probably it makes sense to remove "IF EXISTS" to make sure that the table is available.

(6) For those who run into a similar issue, the SQL Endpoint for the lakehouse is read only, therefore attempting to delete a table via SSMS will result in such an error:

Drop failed for Table 'dbo.walmart_features2'.  (Microsoft.SqlServer.Smo)
The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Drop' was denied on the requested resource.
Cannot drop the table 'walmart_features2', because it does not exist or you do not have permission.

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.