Inter Rater Reliability Study with Cohen’s Kappa and Fleiss’ Kappa

April 23, 2018 at 2:43 PMSteven Wang

Posted in: NLP


Magic Quadrant for Business Intelligence and Analytics Platforms 3 Year in A Row

February 5, 2016 at 9:11 PMSteven Wang

Comparing to past 2 years, the latest published Gartner Magic Quadrant for Business Intelligence and Analytics Platforms has some very interesting changes.

There are only 3 big players stay in the first group: Tableau, Qlik and Microsoft. Given that the MS did a fantastic job on the new Power BI platform, I'm not surprising MS is waken up and rising in this space. I was surprised that Microstrategy, IBM, SAP and Information Builders are fallen out the first group. Particularly, Oracle is dropped off the radar.

This is getting more and more interesting and look forward to how all these players doing this year.

Magic Quadrant for Business Intelligence and Analytics Platforms 3 Year in A Row



Posted in:


Using Shiny Server to Deploy Interactive Visualization Part 1: Data Exploratory Analysis of Red Wine Data

February 3, 2016 at 4:11 AMSteven Wang

Posted in:


Tackling the Windows Installation Issue of R packages: slidify, slidifyLibraries, and rCharts

January 8, 2016 at 11:55 PMSteven Wang

Slidify is an R package helps to create and publish beautiful HTML5 presentations from RMarkdown. Slidify package is fantastic for creating reproducible presentations.

slidifyLibraries contains external resources used by Slidify.

rCharts is an R package to create, customize and publish interactive javascript visualizations from R using a familiar lattice style plotting interface.

All of these three packages are created and maintained by Ramnath Vaidyanathan of McGill University.  Since all these three packages are not on CRAN and needs to be installed from github using the Hadley's devtools package.





As Ramnath mentioned in one of this Slidify workshop, While the installation process from github is relatively painless for Mac/Linux/Ubuntu users, for the windows users it is not that luck. I have to confess that when I used the above code to install these packages it caused a chaos for me. Not only that the package was not installed properly but also some other installed packages were broken!!!

Thiago Silva’s post Installing Slidify on a Windows Machine has provided his advice on how he successfully made a workaround. I will give a step by step guide on top of his blog, which I have successfully installed these 3 packages in window 7 and windows 10 machines.

1.       Download packages zip files

a.       Download slidify zip file from Github 

b.       Download slidifyLibraries zip file from Github 

c.       Download rCharts zip file from Github 

2.       Put zip files into a folder, say, c:/packages

3.       Unzip packages and you will notice that all packages have a “-master” suffix in the folder name. “-master’ will need to be removed as below:

4.       Now we will use the build utility from dvtools package to convert a package source directory into a single bundled file. If set binary = FALSE, this creates a tar.gz package that can be installed on any platform, provided they have a full development environment. Now we will use the below code to build packages:


build("C:/packages/slidify/", binary=FALSE) # Change C:/packages/ to your folder name

build("C:/packages/slidifyLibraries/", binary=FALSE) # Change C:/packages/ to your folder name

build("C:/packages/rCharts/", binary=FALSE) # Change C:/packages/ to your folder name

5.       Once packages are built successfully, we use the below code or alternatively we can use the RStudio to install these packages:

a.       R Script code:

# Change C:/packages/ to your folder name

install.packages("C:/packages/slidify_0.5.tar.gz", repos = NULL, type="source")

# Change C:/packages/ to your folder name

install.packages("C:/packages/slidifyLibraries_0.3.1.tar.gz", repos = NULL, type="source")

# Change C:/packages/ to your folder name

install.packages("C:/packages/rCharts_0.4.5.tar.gz", repos = NULL, type="source")

b.       Install from RStudio, go to manu: Tools, Install Packages… as below:

This has successfully installed these three packages: sidify, slidifyLibraries, and rCharts in windows environment.

I hope this helps and thanks for reading.

Posted in: R

Tags: , ,

A Simple Way of Quantiling Process in T-SQL by using Window Function and Ceiling Function

December 11, 2015 at 6:35 AMSteven Wang

In business analysis, it is very common to break sales into different quantile groups and then it can be used for segmentation purpose. The most commonly used quantiles are:

  • Quintiles which creates 5 equal size groups
  • Deciles which creates 10 equal size groups

Occasionally, Ventiles (20 groups) and Percentiles (100 groups) will used.

Quantiling process is very powerful when 2 dimensional of sales data is available, like product sales and market sales. In business it is very common to use quantile matrix to perform the account segmentation. An example of segmentation definition can be based on a 2 X 2 deciling matrix like below:

Even this is a very common process, it is indeed not a straightforward simple calculation to get quantile values in SQL server. There are many ways we can do to calculate such, but I found that use window function in conjunction with CEILING function is the easiest way to calculate quantile value. Let’s try. I will use a deciling process to start with.

The first step to calculate decile value for a specific object (Account, Product etc) is to calculate cumulative values and then calculate the cumulative percentage.

This will generate the cumulative product sales value and its cumulative percentage. the top 10 records as below:

Now the question is how can we assign a decile value based on the cumulative percentage value? there are many ways to do it. But as I found that there is a very simple way to do it by using SQL CEILING function:

You can randomly check the results:

Based on this deciling calculation, we can generalize the quantiling calculation on any quantile number between 1 and 100.

I hope this tip is useful and please see the attached scripts if you want to have a try.


Quantile_Calculation_Script.sql (1KB)

Posted in: T-SQL | Window Functions


Prediction of Correctness on Barbell Lifting Exercise Based on Accelerometer Data

November 24, 2015 at 2:45 AMSteven Wang

Posted in: Data Cleaning | Big Data | Data Load | Machine Learning | Random Forest | RPART | Predictive Analytics | R

Tags: , ,

Embedded Power View Report Rendering Issue in Different Browsers

March 18, 2015 at 2:19 AMSteven Wang

Power View Report embedded view is indeed a very nice feature and opens up a big opportunities for ISVs and companies and individuals to share data and visualizations. Please check this blog, Embed Power View Interactive Reports in Your Blogs and Websites, for more details about how to embed a power view report.

While this is great news, I also found that there are some frustrations. The embedded view is rendered nicely in IE 9 but there are issues with IE 11 and other main browsers. 

I have put together a few mains issues I found. If you have same issue and different issue please share your experience with me. Thanks.

Posted in: Data Explorer | Power BI | Power View


SSRS 2012: Report Server cannot load the TERADATA/SQLPDW Extension Error

May 8, 2013 at 3:49 PMSteven Wang


In the SQL Server 2012 reporting service server event log, you might have these 2 errors:

1.       Report Server cannot load the SQLPDW Extension

2.       Report Server cannot load the TERADATA Extension

While you are not using the TERADATA/SQLPDW functionalities in your report, basically you can ignore these errors. This is because by default TERADATA and SQLPDW data extension is turned on in the reporting server configuration file.

Even you are not using the TERADATA/SQLPDW functionalities, it is still good idea to suppress the error logged into the event log.

Open the reporting service configuration file rsreportserver.config under your SSRS installation folder. If you installed SSRS 2012 by default, the file will be under:

C:\Program Files\Microsoft SQL Server\MSRS11.MSSQLSERVER\Reporting Services\ReportServer\

If you locate the Data Section you will see the below yellow highlighted data extension TERADATA and SQLPDW is turned on.

If you comment out these 2 extensions as below and save, those error messages will no longer be generated.


Posted in:

Tags: ,

SQL Server 2012 MDS Windows Server Minimal Requisites

April 23, 2013 at 4:49 AMSteven Wang

In order to make SQL Server 2012 MDS functioning properly under Windows Server 2008 R2, the minimum windows features must be met.

1.       1. Add Application Server role

The following role servers for Application Role need to be included:

·         .NET Framework 3.5.1

·         Web Server (IIS) support

·         HTTP Activation

As illustrated in the diagram below (ticked services):


1.       2. Add Web Server (IIS) role

Make sure that the yellow highlighted services are ticked.

1.     3.  Check additional windows features

Make sure the Features listed below (yellow highlighted) are installed:

Posted in: Master Data Service


Script out Indexes and Index Option Settings for a SQL Server Database

December 13, 2012 at 3:58 PMSteven Wang

Sometimes, we need to script out the index definitions and its option settings. If you have a source control then you are lucky, as all index definitions should be kept in your source control system. If you want to just script out indexes for a table, you were also able to do so by selecting all indexes in the object explorer details. But if we have a query can easily script out the index definition and its settings then it will be much easier for testing or index creation automation.

In SQL server, the index definition and option settings are not stored in a single table, you have to join a few system tables or DMVs together to get all those information. Here I will share the script I used to script out all the indexes.

My consideration:

1. Only apply to the user defined indexes;

2. Clustered indexes created with Primary Key constraint is not included though it can be done.

3. Spatial Indexes and XML Indexes are not considered.

4. The Data compression setting for Partitioned indexes is simplified here as it is pretty complicated to be done just in a select statement. 

5. The Drop_Existing and Sort_In_TempDB setting is hard-coded in the script, which depending on your situation you can change it.

The script is as below:

Select		A.[object_id]
			, OBJECT_NAME(A.[object_id]) AS Table_Name
			, A.Index_ID
			, A.[Name] As Index_Name
			, CAST(
					Case	When A.type = 1 AND is_unique = 1 Then 'Create Unique Clustered Index ' 
							When A.type = 1 AND is_unique = 0 Then 'Create Clustered Index ' 
							When A.type = 2 AND is_unique = 1 Then 'Create Unique NonClustered Index '
							When A.type = 2 AND is_unique = 0 Then 'Create NonClustered Index '
					+ quotename(A.[Name]) + ' On ' + quotename( + '.' + quotename(OBJECT_NAME(A.[object_id])) + ' ('
					+ Stuff(
										',[' + COL_NAME(A.[object_id],C.column_id)
										+ Case When C.is_descending_key = 1 Then '] Desc' Else '] Asc' End
								From	sys.index_columns C WITH (NOLOCK)
								Where	A.[Object_ID] = C.object_id
										And A.Index_ID = C.Index_ID
										And C.is_included_column = 0
								Order by C.key_Ordinal Asc
								For XML Path('')
							,1,1,'') + ') '

					+ CASE WHEN A.type = 1 THEN ''
							ELSE Coalesce('Include ('
									+ Stuff(
														',' + QuoteName(COL_NAME(A.[object_id],C.column_id))
												From	sys.index_columns C WITH (NOLOCK)
												Where	A.[Object_ID] = C.object_id
														And A.Index_ID = C.Index_ID
														And C.is_included_column = 1
												Order by C.index_column_id Asc
												For XML Path('')
											,1,1,'') + ') '
						,'') End
					+ Case When A.has_filter = 1 Then 'Where ' + A.filter_definition Else '' End
					+ ' With (Drop_Existing = OFF, SORT_IN_TEMPDB = ON'
					--when the same index exists you'd better to set the Drop_Existing = ON
					--SORT_IN_TEMPDB = ON is recommended but based on your own environment.
					+ ', Fillfactor = ' + Cast(Case When fill_factor = 0 Then 100 Else fill_factor End As varchar(3)) 
					+ Case When A.[is_padded] = 1 Then ', PAD_INDEX = ON' Else ', PAD_INDEX = OFF' END
					+ Case When D.[no_recompute] = 1 Then ', STATISTICS_NORECOMPUTE  = ON' Else ', STATISTICS_NORECOMPUTE  = OFF' End                  
					+ Case When A.[ignore_dup_key] = 1 Then ', IGNORE_DUP_KEY = ON' Else ', IGNORE_DUP_KEY = OFF' End
					+ Case When A.[ALLOW_ROW_LOCKS] = 1 Then ', ALLOW_ROW_LOCKS = ON' Else ', ALLOW_ROW_LOCKS = OFF' END
					+ Case When A.[ALLOW_PAGE_LOCKS] = 1 Then ', ALLOW_PAGE_LOCKS = ON' Else ', ALLOW_PAGE_LOCKS = OFF' End  
					+ Case When P.[data_compression] = 0 Then ', DATA_COMPRESSION = NONE' 
							When P.[data_compression] = 1 Then ', DATA_COMPRESSION = ROW' 
							Else ', DATA_COMPRESSION = PAGE' End                 
					+ ') On ' 
					+ Case when C.type = 'FG' THEN quotename( 
							ELSE quotename( + '(' + F.Partition_Column + ')' END + ';' --if it uses partition scheme then need partition column
			As nvarchar(Max)) As Index_Create_Statement
			, AS FileGroupName
			, 'DROP INDEX ' + quotename(A.[Name]) + ' On ' + quotename( + '.' + quotename(OBJECT_NAME(A.[object_id])) + ';' AS Index_Drop_Statement
From		SYS.Indexes A WITH (NOLOCK)
			sys.objects B WITH (NOLOCK)
				ON	A.object_id = B.object_id
			SYS.schemas S
				ON	B.schema_id = S.schema_id            
			SYS.data_spaces C WITH (NOLOCK)
				ON	A.data_space_id =   C.data_space_id   
			SYS.stats D WITH (NOLOCK)
				ON	A.object_id = D.object_id
					AND A.index_id = D.stats_id  
				Inner Join
			--The below code is to find out what data compression type was used by the index. If an index is not partitioned, it is easy as only one data compression
			--type can be used. If the index is partitioned, then each partition can be configued to use the different data compression. This is hard to generalize,
			--for simplicity, I just use the data compression type used most for the index partitions for all partitions. You can later rebuild the index partition to
			--the appropriate data compression type you want to use
				select object_id, index_id, Data_Compression, ROW_NUMBER() Over(Partition By object_id, index_id Order by COUNT(*) Desc) As Main_Compression
				From sys.partitions WITH (NOLOCK)
				Group BY object_id, index_id, Data_Compression
			)	P
				ON	A.object_id = P.object_id
					AND A.index_id = P.index_id
					AND P.Main_Compression = 1
				Outer APPLY
				SELECT	COL_NAME(A.object_id, E.column_id) AS Partition_Column
				From	sys.index_columns E WITH (NOLOCK)
				WHERE	E.object_id = A.object_id
						AND E.index_id = A.index_id
						AND E.partition_ordinal = 1
			) F     
Where		A.type IN (1,2) --clustered and nonclustered
			AND B.Type != 'S'
			AND is_primary_key = 0 --this is not for primary key constraint
			AND OBJECT_NAME(A.[object_id]) not like 'queue_messages_%'
			AND OBJECT_NAME(A.[object_id]) not like 'filestream_tombstone_%'
			AND OBJECT_NAME(A.[object_id]) not like 'sys%' --if you have index start with sys then remove it
OPTION (Recompile);


I have also attached a sql script file here:

Scriptout_Indexes.sql (4.95 kb)