The quick data profiling solution you didn't know you had access to

Data profiling brings a lot of benefits and provides a first step into better understanding your data structure and quality. There are a few dedicated tools, but they have associated costs. What if I told you that you might already have access to a data profiling solution? Well, if you are using Excel 2010 and higher, you are in luck. Installing the free Power Query extension for the 2010 and 2013 versions will provide you with 2 simple functions that I will cover below. Versions 2016 and higher, including the 365 version, should already come with Power Query enabled.

You can apply the following data profiling solution to almost any data source. This includes MySQL, SQL Server, Oracle, SAP HANA, Azure, SharePoint list, Salesforce reports or objects, ODBC, oData feed, XML, JSON, a simple Excel or CSV file, and others.

To make it clear, this is NOT a paid advertisement and the views on this solution are of my own.

Without further ado, here is a quick and inexpensive way to perform data profiling on a particular data set:

Table.Profile function

Description: This Power Query function returns a total of 8 columns. The first represents the columns in your data source and the remaining 7 describe the following: min, max, average, standard deviation, count, null count, and distinct count.

How to use it: Create a new “Blank Query” query type and type in the following function

=Table.Profile(#”name of the query you are profiling”)

table profile function

Example:

I’m using the Cost of Living Index for Selected US Cities data set and after separating the urban area into city and state, I’m doing a quick data profiling to better understand the state of this data. Even at a first glance, I’m spotting the following:

In the city field, there is the “Akron OH” value, which is obviously including the state as well
The “Grocery Items” average is way below the rest, which might be correct, but worth analyzing further
There is a value missing in the “Grocery Items” and “State” columns
There are 58 unique “State” values, clearly indicating a data quality issue

table profile data profiling

To keep track of your data quality efforts, here is a free template of a data quality issues log.

Table.Schema function

Description: This function returns a table describing the columns in the table you are profiling. Each row describes the properties of each of your table columns and includes: type name, type kind, nullable, numeric precision, numeric scale, maximum length, and a few others. Feel free to read in detail about each one on MS’ Power Query Syntax site.

How to use it: Create a new “Blank Query” query type and type in the following function

=Table.Schema(#”name of the query you are profiling”)

table schema data profiling

Example:

I’m using a free PayStub oData feed which describes information on pay stubs, such as their ID, pay period and date, amount, deducted amount, employee ID and so forth. Just glancing at the results of the schema function, I’m spotting the following:

“Amount”, “GrossPay” and “Deduction” are of Decimal type and maybe they need to be Currency
The Decimal Numbers might need to be changed to a Fixed Decimal Number type in order to improve compression and query performance
The “Period”, which denotes the payment time period the pay stub is for, is currently of Text type, so I’ll have to convert it to DateTime
“PersonID” is also “Text”, which depending on the rules of generating this unique employee ID, might have to be an Integer
“PersonID” allows for null values which should not be allowed if you are only providing pay stubs to employees

table schema example data profiling

Within 5 minutes I’m spotting potential data architecture issues which could have an impact on the quality of the data, data usage or data integration efforts.

Conclusion

For professional data profiling techniques, I recommend dedicated tools, but I wanted to share this alternative as it provides a quick and easy solution when you don’t have access to such tools. The schema and profile functions provide a quick analysis of the state of the data set you are planning on working with. It can point you in the right direction to determine the quality and issues of data, helps you compare schemas for data integration tasks, or determine if the data types comply with the business requirements.

Have you used this before? Was it sufficient to get you started?

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

The quick data profiling solution you didn’t know you had access to

Table.Profile function

To keep track of your data quality efforts, here is a free template of a data quality issues log.

Table.Schema function

Conclusion

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Best practices to achieve optimal source data profiling

How to implement data profiling for successful source data discovery

How to use the barrier analysis for improved data quality