Exploring how to use and deploy Azure Data Factory

Azure Data Factory – The basic

Nguyễn Hoàng Vũ, a senior, passionate developer of Rangers team, recently published an excellent series of blog posts about Azure Data Factory We are please to cross-post them here. His own blog has plenty of good posts that is worth to follow.

Azure Data Factory is the service that helps to copy or transform data from data source to another data source, and some use cases that we can use Azure Data Factory Azure are

Restore data from a production database to development database every day.
Transfer data from on-premises SQL server database to Azure SQL server database.
Backup data from production SQL server database to CSV files every night at 1 AM

There are many use cases that we can use Data Factory on, but I will introduce the first use case that I had been working on it last week.

Three basic things

To use Data Factory, we have to prepare 3 basic things that Data Factory needs to work.

1. Linked services

Linked services are much like connection strings, which define the information that’s needed for Data Factory to connect to external resources. For example, the connectionString in below JSON string describes the connection information to connect to a SQL server database.

{
    "name": "ProdSQLLinkedService",
    "properties": {
        "type": "AzureSqlDatabase",
        "typeProperties": {
            "connectionString":
                "Data Source=production.database.windows.net;Initial Catalog=ProdDB;Integrated Security=False;User ID=Username;Password=Password;"
        }
    }
}

So in this use case, we will have 2 linked services, one is for the production database, and one is for the development database.

ProdSQLLinkedService
DevSQLLinkedService

2. Datasets

Datasets represent data structures within the data stores, such as the table structure we want to restore.

{
    "name": "Input_ProductionUsersTable",
    "properties": {
        "type": "AzureSqlTable",
        "linkedServiceName": "ProdSQLLinkedService",
        "structure": [
            {
                "name": "UserId",
                "type": "Guid"
            },
            {
                "name": "Username",
                "type": "String"
            },
            {
                "name": "DateOfBirth",
                "type": "Datetime"
            }
        ],
        "typeProperties": {
            "tableName": "Users"
        },
        "availability": {
            "frequency": "Day",
            "interval": 1
        },
        "external": true,
        "policy": {}
    }
}

So in this use case, we also will have 2 datasets, one is for the table in production database, and one is for the table in development database.

Input_ProdUsersTable
Output_DevUsersTable

3. Pipeline

A pipeline contains one or more activities that together perform a task. An activity could be a copy data action, or a transform data action or also a custom action that is anything we want to do on our data. So in my use case, I will put a copy activity to the pipeline.

{
    "name": "CopyPipeline",
    "properties":
    {
        "description": "Copy data in table User from Prod database to Dev database for checking",
        "activities": [
            {
                "name": "CopyFromSQLToSQL",
                "type": "Copy",
                "inputs": [
                    {
                        "name": "Input_ProdUsersTable"
                    }
                ],
                "outputs": [
                    {
                        "name": "Output_DevUsersTable"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60:00:00"
                    }
                },
                "policy": {
                    "concurrency": 1,
                    "retry": 0,
                    "timeout": "01:00:00"
                }
            }
        ],
        "start": "2016-07-12T00:00:00Z",
        "end": "2099-12-31T00:00:00Z"
    }
}

Finally, we will have 2 linked services, 2 datasets and 1 pipeline with a copy activity inside.

On the next blog, I will continue with how to deploy them to Data Factory through Azure portal.

Summary

The Azure Data Factory supports many ways to do with the data, but three things above always need to understand at the first step when you walk into this world .