SSIS Merge Join – Inconsistent results

SSIS merge join can produce inconsistent results when joined on varchar columns.
This is primarily because of different collation settings. Because we have set the IsSorted as “TRUE” but it rows are not sorted as per SSIS, so it gives inconsistent results.

SSIS uses windows collation (Latin1_General_CS_AS_WS) where as SQL (Case-Insensitive) can use either of windows or SQL specific collation. In other words, the sort method using ORDER BY clause is different from using SORT transform with SSIS because SSIS uses case-sensitive where as SQL uses case-insensitive. To compare apples-to-apples, SSIS Sort and SQL sort be COLLATED to same.

So when a join is done on the columns with different collations the results can be quite unpredictable.

It mostly works and quite quirky when dealing with varchar columns,

Troubleshooting steps :

1) check issorted and sortkeyposition in the connection manager

2) check the datatypes, length and case sensitivity of the columns

3) RTRIM & LTRIM the source columns to avoid any additional space issues

4) if the sort is done in SQL, use collate statement in the query (isSorted=true need not be set in this case, however sorykeyposition need to be set for the column)

[cc lang=”sql”]

SELECT name, value from names
ORDER BY name
COLLATE Latin1_General_CS_AS_WS

[/cc]

5) if the sort is done in SSIS, (SET issorted = TRUE, sortkeyposition = 1),

set comparison flags = ignorecase, in the sort transformation editor, set the Comparison Flags property to Ignore case.

Historically when I had these problems, the best solution was avoid merge join, use “UNION ALL” transform, then use a conditional split to look for the NULL rows and then continue with the needed transforms.

Observations

1) Setting the IsSorted and SortKeyPosition properties is not sufficient as these properties dont set the sort, so be mindful of the data itself.

2) Cannot collate GUID, so cast them to nvarchar (SSIS with uniqueidentifiers is known to have problems)

Link : The link has connect bug from MS and isnt fixed yet.

 

Merge Join with non-equality operator in SSIS

In TSQL, it is common to encounter scenarios where a join is done between 2-3 data sources and usually done using CTE’s, sub-queries or temptables to do a date range lookups using BETWEEN operator when a baseline date is between min and max dates from different source.

[cc LANG=”TSQL”]

SELECT c.id, o.value
FROM customer c inner join offer o
on c.registereddate BETWEEN o.begindate AND o.enddate

[/cc]

Now it becomes another challenge if both the sources are NOT on the same server or cannot be joined by the linked server because of security restrictions in the environment.

Now if the same problem has to be solved in SSIS, here are few options :

1) sort the sources by dates, merge join, conditional split for date ranges.

2) sort the sources by dates, merge join, script transform for date ranges.

3) use lookup with partial cache (it will have performance impact for a larger dataset) – also called as range lookup.

Setting IsSorted true

IsSorted SSIS

SortKey Position 1

SortKey Position 1

 

Limitations :

1) Please remember that merge join is a blocking transformation (in memory activity) because it has SORT attached to it.

2) MaxBuffersPerInput needs to be tweaked depending on the number of sources that are inputs for the join.

Advantages :

1) The output of merge join also comes out as sorted so subsequent operations can use another merge join in the downstream workflow.

Output Flat File using BCP (without output messages )

To extract CSV file from SQL server, BCP commands canbe used but they will output a line for every 1k rows. This might cause a delay in generating file for a million record table.

At the same time it can be handled using xp_cmdshell but no every envoronment would like to enable for security reasons.

So redirecting to NUL helps in our case.

bcp “select * from [calendar].[Auxiliary].[Calendar]” queryout “C:\temp\cal.csv” -T -t”,” -c 1>NUL

[cc LANG=”SQL]

bcp “select * from [calendar].[Auxiliary].[Calendar]” queryout “C:\temp\sales.csv” -T -t”,” -c 1>NUL
[/cc]

BCP_output

Badass : Making users awesome

Recently I was reading the book “Badass : making users awesome”, here is an excerpt of the book.

Screenshot_2

Screenshot_2 Screenshot_3 Screenshot_4 Screenshot_5 Screenshot_6 Screenshot_7 Screenshot_8 Screenshot_9 Screenshot_10 Screenshot_11 Screenshot_12 Screenshot_13 Screenshot_14 Screenshot_15 Screenshot_16 Screenshot_17 Screenshot_18 Screenshot_19 Screenshot_20 Screenshot_21 Screenshot_22 Screenshot_23 Screenshot_24 Screenshot_25 Screenshot_26Screenshot_1

In-memory(IMDB) Databases – Overview

An in-memory database (IMDB) relies on storing the data in main memory rather than regular storage models like disk system.

Database cache mechanisms might help but it will benefit the read of data and write still has to go to disk.

Another solution to host an entire database in RAM disk might not be that helpful as it still involves multiple transfers of data to/from copying to cache and thus increasing CPU cycles.

Advantages :

  • working with the data in memory is much faster than writing the data to and reading from a disk (fastest data retrieval speeds)
  • Overall reduction in CPU and memory requirements because they execute fewer CPU instructions (eliminates the seek times when querying the data)
  • Avoids the burden on FILE IO
  • performance gains of ~100 times observed with IMDBs as per various TPC benchmarks
  • Scalable up to 1TB range
  • It can be either embedded or client/server DBMS

Industry Scenarios:

  • Embedded applications, set-top boxes that have small memory & CPU footprint
  • algorithmic trading, e-commerce websites

Usage Scenarios:

  • Online betting company bwin was able to support 12000 bets/sec to 150000 bets/sec.
  • ConAgra use of SAP HANA helped deepter insights into analyssi

Durability property of ACID is maintained by usage of Snapshots, checkpoints, Transaction logs, HR solutions (replication, automatic fail over)

Some of popular in-memory databases are dashDB from IBM, MemSQL, SQLite, REDIS, SAP HANA, SQL Server Hekaton, Oracle EXASystems.

Interesting screenshot that captures Oracle in-memory technology with SAP HANA below :

Oracle vs SAP HANA

 

 

 

Traditional BI architecture and data latency issues

In BI world, the traditional analytics model take more effort and cost to implement as compared to recent trends in the analytics world.

Traditional BI Architecture

Now the issues with this architecture is that :

  • Data has to be sourced
  • ETL has to be performed to create DW or datamaet
  • Create OLAP/Cubes
  • Then able to pull the data for reporting purposes.
  • No real time analytics and clearly there is data latency

Hence the recent trend is shifting to give the power to business owners to have the ability to slice and dice the data using power query as information workers ETL tool. Then surface the reports using power view or power maps.

Traditional BI Architecture 2

SSRS – Sys.WebForms.PageRequestManagerServerErrorException: An unknown error occurred while processing the request on the server. The status code returned from the server was: 400

Recently stumbled upon this error when viewing SSRS reports in native mode.

Sys.WebForms.PageRequestManagerServerErrorException: An unknown error occurred while processing the request on the server. The status code returned from the server was: 400

Fix is to modify web.config in report manager and report server folders.

</system.web>
<appSettings>
<add key=”aspnet:IgnoreFormActionAttribute” value=”true” />
<add key=”aspnet:MaxHttpCollectionKeys” value=”100000″ />
</appSettings>
<runtime>

 

 

Quickly find out the service accounts SQL services

TSQL query to find out the if the SQL services are started or stopped, accounts running them etc.,

(Tested in SQL 2012 and above)

[cc lang=”SQL]

select servicename, startup_type_desc, status_desc, last_startup_time, service_account, filename from sys.dm_server_services

[/cc]

Check SQL Server and SharePoint Version Build Dates

Whenever you would like to know the version of SQL server, there are couple of ways :

  • TSQL Command

[cc lang=”SQL”]SELECT @@VERSION[/cc]

  • SQL_version_4
  • SSMS GUI

SQL_version

  • TSQL command
  • [cc lang=”SQL”]
    XP_READERRORLOG
    [/cc]

SQL_version_2

  • TSQL command

[cc lang=”SQL”]
SELECT SERVERPROPERTY(‘productversion’), SERVERPROPERTY (‘productlevel’), SERVERPROPERTY (‘edition’)
xp_msver

[/cc]

  • SQL_version_3

Then you can check if the SQL server needs any SP updates using below table (source in hyperlink). Feel free to check for SQL & Sharepoint updated build numbers.

SQL_Server_build_numbers

 

SQL_Server_2016_build_numbers

CREATE FILE encountered operating system error 5 (failed to retrieve text for this error. Reason 15105) while attempting to open or create the physical file

After rebuilding the system databases, I was trying to restore the MDF files and received this error. It tells that it isnt able to access the MDF file. To confirm this I have verified and re-verified all the permissions running the SSMS to have full control on the DATA folder containing MDF files. Still it continued to raise error.

SSMS_as_Admin_1

It is a windows permission issue. If it is an instance, give permissions to the MSSQL$ instance account full permissions on the data folder.

Or the brute force method is to run SSMS as administrator.