Azure Upload a large file via chucking ~ dev

If you can set a couple of properties and upload a file in blocks easily, why would you want to do it programmatically? The case that immediately comes to mind is if you have files that are less than 1 MB and you want to send them up in 256kb blocks. The minimum value forSingleBlobUploadThresholdInBytes is 1 MB, so you can not use the method above.

Another case is if you want to let the user pause the upload process, then come back later and restart it. I’ll talk about this after the code for uploading a file in blocks.

To programmatically upload a file in blocks, you first open a file stream for the file. Then repeatedly read a block of the file, set a block ID, calculate the MD5 hash of the block and write the block to blob storage. Keep a list of the block ID’s as you go. When you’re done, you call PutBlockList and pass it the list of block ID’s. Azure will put the blocks together in the order specified in the list, and then commit them. If you get the Block List out of order, or you don’t put all of the blocks before committing the list, your file will be corrupted.

The block id’s must all be the same size for all of the blocks, or your upload/commit will fail. I usually just number them from 1 to whatever, using a block ID that is formatted to a 7-character string. So for 1, I’ll get “0000001”. Note that block id’s have to be a base 64 string.

Here’s the code for uploading a file in blocks. I’ve put comments in to explain what’s going on.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

CloudBlockBlob blob = cloudBlobContainer.GetBlockBlobReference(Path.GetFileName(fileName));

int blockSize = 256 * 1024; //256 kb

using (FileStream fileStream = 

  new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))

{

  long fileSize = fileStream.Length;

  //block count is the number of blocks + 1 for the last one

  int blockCount = (int)((float)fileSize / (float)blockSize) + 1;

  //List of block ids; the blocks will be committed in the order of this list 

  List<string> blockIDs = new List<string>();

  //starting block number - 1

  int blockNumber = 0;

  try

  {

    int bytesRead = 0; //number of bytes read so far

    long bytesLeft = fileSize; //number of bytes left to read and upload

    //do until all of the bytes are uploaded

    while (bytesLeft > 0)

    {

      blockNumber++;

      int bytesToRead;

      if (bytesLeft >= blockSize)

      {

        //more than one block left, so put up another whole block

        bytesToRead = blockSize;

      }

      else

      {

        //less than one block left, read the rest of it

        bytesToRead = (int)bytesLeft;

      }

      //create a blockID from the block number, add it to the block ID list

      //the block ID is a base64 string

      string blockId = 

        Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}",

          blockNumber.ToString("0000000"))));

      blockIDs.Add(blockId);

      //set up new buffer with the right size, and read that many bytes into it 

      byte[] bytes = new byte[bytesToRead];

      fileStream.Read(bytes, 0, bytesToRead);

      //calculate the MD5 hash of the byte array

      string blockHash = GetMD5HashFromStream(bytes);

      //upload the block, provide the hash so Azure can verify it

      blob.PutBlock(blockId, new MemoryStream(bytes), blockHash);

      //increment/decrement counters

      bytesRead += bytesToRead;

      bytesLeft -= bytesToRead;

    }

    //commit the blocks

    blob.PutBlockList(blockIDs);

  }

  catch (Exception ex)

  {

    System.Diagnostics.Debug.Print("Exception thrown = {0}", ex);

  }

}

You can actually split the file up and upload it in multiple parallel threads. For my use case (customer has insufficient internet speed), that wouldn’t make sense. If he can’t upload chunks bigger than 256MB, then he can’t upload 2 or 3 or 4 of those at the same time. But if you have decent upload speed, you could definitely upload multiple blocks in parallel.

What if you what to give the customer the ability to start an upload, stop it, and resume it later? The customer is uploading a file with your application, and he hits pause and goes off to do something else for a while. When he hits pause, you just stop uploading the file. When he comes back and asks to resume the upload, call to get a list of the uncommitted blocks that have been uploaded and put each blockListItem.Name in a List. Start reading the file from the beginning. Read each block in and create the blockID the same way you created it before. Add this to the list of blockIDs that you are going to use to commit all the blocks at the end. See if the blockID is in the list of uncommitted blocks. If it is, remove it from the list of uncommitted blocks because you’ve found it, and won’t find it again, so why bother leaving it in the search list? If the blockID is not in the list of uncommitted blocks, call PutBlock to write the block to Blob Storage.

After reading the whole file and putting all of the missing blocks, call PutBlockList with the list of blockIDs to commit the file.

This is pretty close to the same code as above, except it calls to get the list of uncommitted blocks, and does the check to see if the block is already committed before writing the block.

Instead of requesting a list of committed blocks from blob storage, you could keep track of the list on your own and store it somewhere on the customer’s computer. I’d rather query blob storage, it feels safer somehow because the list can’t be accessed by the customer. (It is, after all, his computer).

Another consideration you might think about is if the file the customer is uploading can be changed between the time he starts the upload and the time it finishes. When I used this upload method, I was taking a bunch of images and an mp3 file and creating a zip file with a unique name and uploading the zip file. The customer could find the zip file on the computer and mess with it, but it was extremely unlikely. Also, if the customer created another zip file, it would be queued after the first one, and start uploading after the first upload finished.

You can upload some blocks, wait a couple of days, upload some more blocks, wait another couple of days, etc. Uncommitted blocks will be cleared automatically after a week unless you add more blocks to the same blob or commit the blocks for the blob. Here’s the code you can use to retrieve the list of blocks; the print statement shows you the members you can access for each block, and you can see the blockListItem.Name and the property telling if it’s a committed block.

1

2

3

4

5

6

7

8

IEnumerable<ListBlockItem> blockList = blockBlob.DownloadBlockList(BlockListingFilter.All);

index = 0;

foreach (ListBlockItem blockListItem in blockList)

{

  index++;

  System.Diagnostics.Debug.Print("Block# {0}, BlockID: {1}, Size: {2}, Committed: {3}",

      index, blockListItem.Name, blockListItem.Length, blockListItem.Committed);

}

To test your code, you can run the regular upload in debug and stop it when it gets past a handful of blocks, then run the routine that checks for the block status, uploads the rest of the blocks, and commits all of the blocks.

One thing to note: You can add the code to get the blocks and do the check to see if they are committed before uploading them to your Upload routine. I chose not to do this, but to use an almost-identical copy with those code bits added in, because retrieving the list of blocks will take some small usage of the performance, so I only want to incur that hit when I know there is a possibility that the file has been stopped and needs to be restarted.

dev

stolen from around the world

Wednesday, November 19, 2014

Azure Upload a large file via chucking