When designing highly scalable architectures for modern machines, you will often need to do some form of manual parallelism control. Managing this is not always easy, but in this blog I will give you one piece of my toolbox to help you.
Let us walk through an example together, a tiny case study. This is a problem which many of you will be familiar with.
Let us say you have 16 files that you want to load into the same table in your database in an automated manner. The naïve approach will do something like this:
BULK INSERT MyTarget FROM ‘C:tempMyFile1′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘n’)
BULK INSERT MyTarget FROM ‘C:tempMyFile2′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘n’)… etc…
BULK INSERT MyTarget FROM ‘C:tempMyFile16′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘n’)
Now here is the problem with this approach: it executes one statement at a time. Sequential execution is BAD, you need to stop thinking about the world like that if you want to scale on a modern architecture.
Lets assume we have enough hardware resources (in this case, it would take a blade server and a decent I/O system). What we really want is to run every one of these statements in parallel. Unfortunately, SQL server does not have a command to start up new connection from inside T-SQL… what to do?
Getting to the Command Line
Because you cannot execute more than one command on a single connection at a time, we will need multiple connections to SQL Server and this mean we have to go back to the command line. Let us start by creating a little batch file Worker.Cmd with this content:
REM Worker.Cmd File
CALL SQLCMD –S.MyServer –q”BULK INSERT MyTarget FROM ‘C:TempMyFile%1’ …EXIT
This allows us to invoke a bulk load for the first file by executing: Worker.Cmd 1
Unfortunately, we still cannot start multiple connections without manually firing up a lot of command prompts. The coders in the audience may at this point reach for their favorites programming language to write a little utility that can spawn multiple copies of an executable.
However, there is a problem with such a home made executable: you cannot generally rely on a server having the necessary runtime libraries. Typical comments might be:
“No, we don’t have .NET 4.0 here, this is not yet certified by our infrastructure department. Could you recompile it for 1.1 please?”.
“Power Shell is much too fancy for us, what is wrong with running this on Windows 2000?”
Perhaps this customer is just skeptical about letting you run your executable on a server. This may sound silly, but I have seen this happen too many times to make assumptions.
Start to the Rescue
There is a very nice little utility for the good old command prompt that allows you to fire up new processes: START.EXE. This comes with all versions of Windows and it takes any command line executable as input, fires it up in a new thread and returns control back to the caller.
Using start.exe, we can write batch script that fire up multiple copies of the same executable. It looks like this:
REM SpawnMany.cmd
REM Author: Thomas Kejser
REM Purpose: Spawns many copies of the same executable. Useful for running many things in parallel@ECHO OFF
ECHO Spawning %2 copies of %1FOR /L %%i IN (1, 1, %2) DO (
ECHO Spawning thread %%i
START "Worker%%i" /Min %1 %%i %2
)
Each new process is started in a minimized window and we pass the thread number and the total number of threads to it. Using this little batch script, we can now do this:
SpawnMany.exe Worker.Exe 16
This starts 16 workers, each with their own thread number assigned. Very useful for running stuff in parallel in a quick and dirty way. For example, I use this to run the TPC-H data generator dbgen.exe highly parallelized.
Notice that I added the EXIT command at the end of the worker.cmd batch. This makes sure that the window closes itself when done executing.
Summary
In this blog, I have shown you how to write a little batch script to fire up multiple threads, from the command line. each doing their own work in parallel. The script is “zero dependency” which makes it ideal for server use and for hacking together quick and dirty parallelism for test scenarios.
I mentioned that SQL server does not have a way to start up new connections from T-SQL. This is not strictly true. Sorry for leading you astray, but I wanted you to see how to do this from the command line first (and go through the pain ). There is a way to hack SQL Server and implement a stored procedure I like to call sp_executesql_async. This will be the subject of a future blog, but since I am heading into a lab for a few weeks, you just have to wait for it.
More about Warehousing at: DW and Big Data
The post Running Many Batch Statements in Parallel appeared first on Fighting Bad Data Modeling.