hawk.ro / stories / PowerShell, MgGraph and files uploading

PowerShell, MgGraph and files uploading

The why

Probably because I have an appetite for pain, but also because I had to get a bunch of files (around 400,000; totalling around 800GB) onto SharePoint online (MS365), I started investigating how I can upload them using a PowerShell script, rather than relying on the OneDrive sync agent. I started with the naive assumption that I can rely on the (OneDrive) sync agent, but that one takes its sweet time and might, eventually, at some point before the heat death of the Universe, get around to it. If it doesn't crash first.

Another minor requirement was that I quite wanted to preserve the lastModifiedDateTime, aka mtime of the files, and the whole thing had to be done without tenant admin rights.

The first steps

A word about terminology. In this context, drive refers to a SharePoint document library. Many cmdlets have drive in their name and take a DriveId parameter (good thing I didn't have to type -DocumentLibraryId everytime *shudder*), so drive it is.

First steps were relatively easy. After using a convoluted way to figure out the SiteId, Get-MgSiteDrive with that SiteId yields a DriveId that's needed for all further interactions.

Then the fun begins: There are two ~~functions~~ cmdlets¹ that list drive contents, depending whether listing the root of the drive, or a folder². Smart.

Get-MgDriveRootChild lists children of the root (takes DriveId as argument)
Get-MgDriveItemChild lists children of a folder (takes DriveId and DriveItemId as arguments, the second one being the id of the parent folder)

As opposed to... I don't know, probably all hierarchical filesystem APIs in the past half a century³, there are separate calls, depending on whether one deals with the root of the drive or any other folder. Yes, I know that in actual fact that entire thing is flat and folders are much less important than they would be in a normal filesystem, but we're still operating on filesystem abstractions here. Also, a flat FS doesn't really scale (good luck browsing a flat Document Library with thousands of files in it), but let's not get into this right now.⁴

Anyway, let's pretend root is forbidden territory and only work within existing folders, just for simplicity's sake. Decide on a folder and move on. Just as proof of concept, I try to recursively dig into an (existing) drive and list the files found there and that seem to go fine. Cautious optimism ensues.

So far I can list items (files and folders), how about creating files or folders? I found the New-MgDriveItem (and its sibling New-MgDriveItemChild) and it was at that point, when reading about them on the official Microsoft documentation site (learn.microsoft.com), that the seed of this rant fell on fertile ground. And the ground was made more and more fertile, as expetives addressed to Microsoft and its documentation started piling up. The next paragraphs are actually the first ones that I wrote for this story.

In the beginning was the rant

I went to Microsoft learn, and ended up on the page describing the New-MgDriveItem cmdlet (same goes for New-MgDriveItemChild): https://learn.microsoft.com/en-us/powershell/module/microsoft.graph.files/new-mgdriveitem?view=graph-powershell-1.0
I was interested in "-File", the relevant section indicated that "To construct, see NOTES section for FILE properties and create a hash table."
...mkay. Scrolling down in the (huge) NOTES section... the page abruptly ends. Specifically, it ends with

[WebHtml ]: For embed links, this property contains the HTML code for an

and that's it. for an. End of page. Nothing more.

Eventually I ended up on GitHub. Did you know that GH is unable to show this file at all?

Screenshot from GitHub interface, trying to look at New-MgDriveItem.md of 2.1MB and GitHub reporting "(Sorry about that, but we can’t show files that are this big right now.)"

New-MgDriveItem.md by Microsoft being too large for GitHub to display

I agree that there should be a limit somewhere, but drawing that limit at 2MB of text in the year 2024, because otherwise their poor little servers will be too overloaded⁵, strikes me as a bit on the short side. Not to mention that it's their own [EXPETIVE] documentation! But at least, I can get the raw file, which, while a bit on the large side (just over 2MB), is still manageable. I'm going to use this example the next time someone praises web this and cloud that.

In ye olden days of yore, first, if there was documentation (and it usually was), it didn't end suddenly after around 2900 lines (without even a sign that there is something more there), and second, the various user interfaces in use back then were even able to display formatted text documents of shrug millions of characters!

In the year 2024, one of the largest software companies in the world is unable to show -on their own site- the full documentation page for one of their own commands. It doesn't even indicate that the page is incomplete. The same company, having acquired and using a different platform (GitHub), is unable to render on said platform the contents of a markdown file that is 2.1MB in size.

Photo of a HP 200LX Palmtop PC with 2MB of RAM. The screen shows WordPerfect view of MGDRVI.MD, 2,199,941 bytes, at 7%. The screen is full of text, first two lines containing - `[WebHtml <String>]`: For embed links, this property contains the HTML code for an <iframe> element that will embed the item in a webpage. [...]

The above-mentioned file being displayed on a the screen of a palmtop (with 80186 16bit CPU), more than quarter of a century old.

Get on with it!

Ok, back to the story. -File should be a hash table that contains some information about the file to be created. Same thing goes for folders. Except that.. it doesn't work. I keep receiving error messages from the back-end that the file (or folder) facet is mandatory, despite being provided.

Eventually, I find another example for creating a folder, on the API page rather than on the PowerShell page, and that constructs a bigger hash table containing some additional info and instead of passing -Folder, that example passes the constructed table as -BodyParameter. In their example, the Folder section also has a property (i.e., it's not empty), so it seems that despite none of the properties being mandatory, at least one must exist.

So, to surmise: To create a file or a folder, one must create the entire BodyParameter hash table and pass that as -BodyParameter to New-MgDriveItemChild. Said hash table must contain either a file or a folder sub-table, and that one must contain at least one key-value. E.g. ChildCount=0 for folders, or MimeType='application/octet-stream' for files.

Good, I now have folders and (empty) files. Fill the files with bytes!

Sending the bytes to the Cloud. Slicing, slice size and poking fun at MS.

First approach: Set-MgDriveItemContent seemed promising, even though it sort of mentioned that maybe it's just for small-ish files. Having tried it, I will add: And if one doesn't mind that it can randomly fail for these as well, for no discernible reason whatsoever.

The hard way it is, then. Create an upload session (New-MgDriveItemUploadSession) and pour bytes there. As with everyhing else, this is harder than it looks.

First, while this cmdlet exists, there is no corresponding "pour bytes into upload session" cmdlet - one has to construct one's own request headers and then Invoke-MgGraphRequest with -Method PUT in order to upload the bytes there. <Sarcasm>As an aside, I think Microsoft is a bit salty about always being called on the whole 640K ought to be enough for everybody thing and decided, you know what would be best? If we were to enforce the size of the upload slices to be exactly half of that.</Sarcasm> You think I'm joking? Quoth the ~~raven~~ documentation:

Note: If your app splits a file into multiple byte ranges, the size of each byte range MUST be a multiple of 320 KiB (327,680 bytes).

From: https://learn.microsoft.com/en-us/graph/api/driveitem-createuploadsession?view=graph-rest-1.0

Wait. How to get the bytes to send?

With the proper magic in place.. wait. How do I actually get the bytes out of the file? Get-Content seemed like a good candidate for my approach but of course, no such luck. Get-Content -Raw -Encoding Byte seems like it might do something related but I can't get just a slice, it's all or nothing. Luckily (and in all seriousness, this is one of the things I really like about PowerShell), I can reach into the Windows API and just do:

$buf=New-Object byte[] $csize
$stream=[System.IO.File]::OpenRead($Path)
# error checking, getting things in place, and then...
$rc=$stream.Read($buf,0,$csize)
# some more stuff before finally sending the bytes "up in the cloud":
$urr=Invoke-MgGraphRequest -Uri $us.UploadUrl -Method PUT -Headers $hdr -Body $ubuf -SkipHeaderValidation

A couple of notes: despite the header being constructed exactly as recommended in the documentation, Invoke-MgGraphRequests isn't happy with it, and -SkipHeaderValidation is needed to get around that. Also, keen eyes might have noticed that I read in $buf and send $ubuf. Yeah, about that...

More slicing woes

I lost two [EXPLETIVE] hours on that! You see, there are two things at play here. First, file sizes have a habit of not being an exact multiple of ~~half of 640KB~~ 320KB (surprising, I know), so, by necesity, the last slice has to be smaller. Secondly, of course Invoke-MgGraphRequest doesn't have an option to send just part of the buffer, no, that would be too easy. PowerShell supposedly allows one to extract just part of an array, by using the syntax $buf[0..($len-1)], and indeed that seems to give the right result, except that it's not, and the back-end complains about size mis-match. Thus, the need to create a separate buffer, of just the right size, and copying the range into it, so that I can finally finalize the file. Who cares about one more memcpy at this point?

By default, the upload session is closed automatically once the final part is uploaded, so that only leaves the small task and modifying (once more) the lastModifiedDateTime so that the file has the original timestamp, and... done.
Almost.

The two surprises

First, Invoke-MgGraphRequests overwrites some internal session URI variable, so that all subsequent requests (e.g. Update-MgDriveItem) use the temporary URI instead of "https://graph.microsoft.com/v1.0". See https://github.com/microsoftgraph/msgraph-sdk-powershell/issues/2437
Second, the requests themselves are somewhat rate-limited, so that it's not a good idea to go with the smallest block size of 320K (and neither with the funniest 640K one). In my case, choosing a slice size of 3200KB seemed to yield reasonable performance and further increases didn't make a difference.

The mtime

For files: Do it after the upload. For folders: do it after uploading all the files in that folder.

Build a hash of a hash fileSystemInfo containing createdDateTime and lastModifiedDateTime. Those should be set to the ISO 8601 timestamp string (for comfort I used the same timestamp for both). MS documentation being what it is, I don't really know where the 'O' comes from, but, for a given datetime, the following incantation yields the ISO string:

$ISO_Timestamp=$de.LastWriteTimeUtc.GetDateTimeFormats('O')[0]

In this example, $de is one of the elements returned by Get-ChildItem for a local path

Conclusion

It is possible to use this approach for uploading large quantities of files. The process is single threaded (on purpose!) and not very fast, but seems quite robust. Currently the script is designed to be used on empty locations, it doesn't do a "merge" of existing folders, and it doesn't try to deal with naming conflicts (the API has a provision for this). It also records the failures into a global variable (available once the sript finished), for further analysis. So far I've only encountered a few errors (mainly caused by naming conflicts).

The script is partly an exercise for myself, partly a scratch for a particular itch - the above-mentioned need to upload a bunch of files. I think the largest one file uploaded was around 30GB. There are no provisions to deal with rate-limiting responses from the back-end, these might come in a future revision, or never given that the itch is mostly gone.

Just to play the irony to the end, the script is on GitHub https://github.com/Hawkuletz/MgExp.

¹ I suppose these cmdlets implement API calls, so it might be an API rather than function/cmdlet issue

² I used to insist on calling them directories. Everyone else is now calling these folders. I gave up. Folders. In its favor, it has less letters, and I already think I will develop RSI from typing PowerShell commands.

³ IIRC, FILES-11 ODS-2 on VMS (and, supposedly on RSX-11 as well) had a root directory called [000000], but nevertheless, a root directory

⁴ I don't care what's in the back-end of a SP "drive", whenever I think about that, I remember https://www.jwz.org/blog/2004/03/when-the-database-worms-eat-into-your-brain/. Yes I know they've been wanting to pile these worms on us since Longhorn.

⁵ Of course, Microsoft can't really afford server capacity nowadays, what with everything being devoted to AI, who needs documentation in this day and age?!

Published 2024-11-07 by Mihai Gaitos - contact hawk.ro