SharePoint External Storage API – Crushing My Dream
When I read several months back, that Microsoft was going to supply an API for external storage in MOSS/WSS, I sprang from my desk and danced around the room and babbled incoherently as if I’d been touched by Benny Hinn.
Okay, well maybe I didn’t quite do that. But what I did do was forward the KB article to a colleague whose company is the leading reseller for EMC Documentum in my town. We’d previously had one of those conversations over a few beers where we questioned the wisdom of SharePoint’s potentially unwise, yet unsurprising use of SQL Server as the storage engine.
…so what’s wrong with SQL?
I am going to briefly dump on SQL here for a minute. But first, let me tell you, I actually like SQL Server! Always have. I hated other Office Server products like Exchange until the 2003 version and SharePoint until the 2007 version. But on the whole, I found SQL to be pretty good. So hopefully that will stop the SQL fanboys from flaming me!
Those readers who appreciate capacity planning issues would appreciate the challenges SQL based storage brings to the table.Additionally, those who have used Enterprise Information Management products like Documentum or Hummingbird (now OpenText) would nod as if Microsoft have finally realised the error of their ways with this updated API.
All of the SharePoint goodies like version control, full text indexing and records management come at a price. Disk consumption and performance drain. Microsoft say to plan for 1.5 times your previous growth in disk usage. In my own real-world results it is more like 2.5 times previous growth. Disk I/O is also increased markedly.
“So what? Disk is cheap”, you reply. Perhaps so, but the disk itself was never the major cost. Given that this is a SQL database we are talking about, a backup of a 100 gigabyte SQL database could take hours and a restore possibly longer. A differential backup of a SQL database would be the entire 100 gig as it is generally one giant database file! So the whole idea of differential backups during the week and full backups on weekends suddenly has to be re-examined. Imagine a disk partition gets corrupted rendering the data useless. In a file server, this might mean 20% of shared files are unavailable while a restore takes place. In a SQL world, you have likely toasted the whole thing and a full restore is required. Often organisations overlook the common issue of existing backup infrastructure not being scalable enough to deal with SQL databases of this size.
“100 gig”, you scoff, don’t be ridiculous”. Sorry but I have news for you. At an application level, there is a scalability issue in that the lowest logical SharePoint object that can have its own database is a Site Collection, not an individual site. For many reasons, it is usually better to use a single site collection where possible. But if one particular SharePoint site has a library with a lot of files, then the entire content database for the site collection is penalised.
Now the above reasons may be the big ticket items that vendors use to sell SAN’s and complex Backup/Storage solutions, but that’s not the real issue.
The real issue (drumroll…)
It may come as a complete shock to you, but documents are not all created equal. No! Really? 🙂 If they were, those crazy cardigan wearing, file-plan obsessed document controllers and records managers wouldn’t exist. But as it happens, different content is handled and treated completely differently, based on its characteristics.
Case in point: Kentucky Fried Chicken would have some interesting governance around the recipe for its 11 herbs and spices, as would the object of Steve Balmer’s chair throwing with their search engine page ranking algorithms.
I picked out those two obvious examples to show an extreme in documents with high intrinsic value to an organisation. The reality is much more mundane. For example, you may be required by law to store all financial records for seven years. In this day and age, invoices can be received electronically, via fax or email. Once processed by accounts payable, invoices largely have little real value.
By using SQL Server, Microsoft is in effect allocating an identical cost of each document in terms of infrastructure cost. Since all documents of a site collection reside inside a SQL content database, you have limited flexibility to shift lower value documents to lower cost storage infrastructure.
How do the big boys do it then?
Documentum as an example stores the content itself in traditional file shares and then stores the name and location of that document (and any additional metadata) in the SQL database. Those of you who have only seen SharePoint may think this is a crazy idea and introduce much more complex disaster recovery issues. But the reality is the opposite.
Consider the sorts of things you can do with this set-up. You can have many file shares on many servers or SAN’s. Documentum for example, would happily allow an administrator to automatically move all documents not accessed in three months to older, slower file storage. It would move the files and then update the file location in SQL so the new location is hidden from the user and they don’t even know it has been moved. Conversely, documents on older, slower storage that have been accessed recently can be moved back to the faster storage automatically.
It also facilitates geographically dispersed organisations having a central SQL repository for the document metadata, but each remote site has a local file store, to make retrieval work at LAN speeds for most documents. This is a much simpler geographically dispersed scenario than SharePoint can ever do right now.
Restores from backup are quite simple. If a file server corrupts, it only affects the documents stored on that file server. Individual file restores are easy to perform and you don’t have to do a major 100gig database restore for a few files.
Furthermore, documents that have a compliance requirement, but do not need to be immediately available, can easily be archived off to read-only media, thus reducing disk space consumption. The metadata detail of the file can still be retrieved from the SQL database, but location information in the SQL database can now refer to a DVD or tape number.
For this reason, it is clear that SharePoint’s architecture has some cost and scalability limitations when it comes to disk usage and management, largely due to SQL Databases and the limitation of Site Collections for content databases.
So how can we move less valuable documents onto less expensive disk hardware? Multiple databases? Possibly, but that requires multiple site collections and this complicates your architecture significantly. (Doing that is the Active Directory equivalent of using separate forests for each of your departments).
Note to SharePoint fanboys: I am well aware that you can ‘sort of’ do some of this stuff via farm design, site design and 3rd party tools. But until you have seen an high end enterprise content management system, there is no contest.
So you might wonder why SharePoint is all the rage then – even for organisations that already have high end ECM systems? Well the short answer is other ECM vendors’ GUIs suck balls and users like SharePoint’s front end better. (And I am not going to provide a long answer 🙂 )
Utopia then?
As I said at the start of this post, I was very happy to hear about Microsoft’s external storage API. In my mind’s eye, I envisaged a system where you create two types of document libraries: ‘standard’ document libraries that use SQL as the store and ‘enhanced’ document libraries that look and feel identical to a regular document library, but it stores the data outside of SQL. Each ‘enhanced document library’ would be able to point to various different file stores, configured from within the properties of the document library itself.
Utopia my butt!
Then a few weeks back some more detail emerged in SDK documentation and my dream was shattered. This really smells like a “just get version 1 out there and we will fix it properly in version 2” release. I know all software companies partake in this sales technique, but it is Microsoft we are talking about here. Therefore it it my god given right…no…my god given privilege to whine about it as much as I see fit.
Essentially this new feature defines an external BLOB store (EBS). The EBS runs parallel to the SQL Server content database, which stores the site’s structured data. You will note that this is pretty much the Documentum method.
In SharePoint, you must implement a COM interface (called the EBS Provider) to keep these two stores in sync. The COM interface recognizes file Save and Open commands and invokes redirection calls to the EBS. The EBS Provider also ensures that the SQL Server content database contains metadata references to their associated BLOB streams in the external BLOB store.
You install and configure the EBS Provider on each Web front end server in your farm. In its current version, external BLOB storage is supported only at the scope of the farm (SPFarm).
Your point being?
If you haven’t realised why I marked the previous sentence in bold, it is this. Since this EBS provider can only be supported at farm scope, then every document library on every site on every site collection in your farm now saves its data via the EBS provider.
So there is utterly nil granularity with this approach. It’s an all or nothing deal. (There goes my utopian dream of two different types of document libraries). All of your documents in the farm are doing to be stored in this EBS provider!
But it gets worse!
The external BLOB storage feature in the present release will not remain syntactically consistent with external BLOB storage technology to be released with the next full-version release of Microsoft Office and Windows SharePoint Services. Such compatibility was not a design goal, so you cannot assume that your implementation using the present version will be compatible with future versions of Microsoft Office or Windows SharePoint Services.
So basically, if you invest time and resources into implementing an EBS provider, then you’re pretty much have to rewrite it all for the next version. (At least you find this out up front).
No utility is available for moving BLOB data from the content database into the external BLOB store. Therefore, when you install and enable the EBS Provider for the first time, you must manually move existing BLOBs that are currently stored in the content database to your external BLOB store.
Okay that makes sense. It is annoying but I can forgive that. Basically if you implement an EBS provider and enable it, your choices for migrating your existing content into it is a backup and restore/overwrite operation or simply wait it out and allow the natural process of file updates do the job for you.
When using an external BLOB store with the EBS Provider, you must re-engineer your backup and restore procedures, as well as your provisions for disaster recovery, because some backup and restore functions in Windows SharePoint Services operate on the content database but not on the external BLOB store. You must handle the external BLOB store separately.
I would have preferred Microsoft to flesh this statement out, as this will potentially cause much grief if people are not aware of this. It implies that STSADM isn’t going to give you the sort of full-fidelity backup that you expect. Yeouch! I feel I might get a few late night call-outs on that one!
Ah, but wait a minute there, sunshine, is that any different to now? STSADM backup and restore is not exactly rock solid now!
Any error conditions, resource drag, or system latency that is introduced by using the EBS Provider, or in the external BLOB store itself, are reflected in the performance of the SharePoint site generally.
Yeah whatever, this is code word for Microsoft’s tech support way of getting out of helping you. “I’m sorry sir, but call your EBS vendor. Thank you come again!”
Conclusion
I can’t say I am surprised at this version 1 implementation, but I am disappointed. If only the granularity extended to a site collection or better still an individual site, I could forgo the requirement to extend it to individual document libraries or content types.
So it will be interesting to see if this API gets any real uptake and if it does, who would actually use it!
later
Paul
MOSSuMS
, gazing misty-eyed into the future…
Ideally, you should be able to assign content types and (depending on user permissions) individual content, a storage policy.
That storage policy then determines the content’s location and where it gets moved to over time.
The primary store (currently SQLServer, but I imagine LINQ will facilitate this expanding to XML etc over time) then only needs limited identification, version, provider/location and content meta-data info.
Those alternative provider/locations could be files, ECM, or, of course, other SQL server content databases.
The blurring dividing line between these stores, on what ‘lives with’ content and is ‘defined as content’, will broaden over time – my personal view is all data (including code, apps etc) is a just a different type of content. Therefore catalogues/directories, storage policies, security, backup, multiple store integration and even source control integration, will become truly huge issues.
I’m no futurist, so where do you see things moving?
Mike MOSSuMS Stringfellow
Great stuff Mike
I started out as an ideallist but as I get older I get more pragmatic 🙂 Your storage policy concept is excellent and lends itself to granular replication scenarios as well, since that can be viewed as another element of storage policy. It might be worth a post on fleshing this idea out further.
On the blurring of dividing lines between stores and content issues, I think it would be a mistake to be an idealist here. There are simply too many forms of ‘content’ with differing requirements to be able to broaden the scope out to this level. So for now I still think in terms of ‘document libraries’ as Microsoft have currently defined them. (for now anyway)
But a ‘document library’ the way it has been implemented in SharePoint is quite limiting in many run of the mill document management scenarios as it is.
If I were to make a prediction on where I see things moving in relation to storage and SharePoint, since document libraries are templatable, I see your storage policy idea in combination with a few other concepts (and the existing templating system) resulting in a raft of new, custom libraries customised document libraries as a new ‘libray type’.
Thanks for taking time to write this up. External storage is an interesting topic and you’re definitely on the cutting edge with this post.
It clearly isn’t a prime time solution, but it does seem that MS is heading in a useful direction. Hopefully they look at posts like this and plan to do the right thing in the near future.
Nice one Mike,
I had a run through of AvePoint’s tools last week, apparantley on the horizon is the kind of functionality you wrote about, where SQL serves as a great big list of stubs pointing to the actual files and their versions etc that reside on file shares. So could be not to far away. Actually I did see this demonstrated for version history as they have it working in that capacity.
I have no idea on pricing but if you are medium sized organisation i’d reckon you could sell off a couple of company cars, or trade in old filing cabinets for scrap metal to see if you could raise enough cash for licenses.
Cheers
AJ
Thanks for going deep:
EBS (likeness) is to be built into SQL Server 2008 stream native type, and combined with a DFS storage location these files would have only their changes (delta) propagated to other sites that are in geographically dispersed locations (yet another shortcoming of SharePoint). Once SharePoint/SQL/DFS all work together things can be good.
Excellent, eye-opening piece! I’m now a convert to the external-vs.-SQL storage philosophy. I used to work in the high-end DM world but drifted into SharePoint consulting full-time shortly after SPS 2001 was released, so I’ve seen both sides of this industry. Anyway, another reason why SharePoint is “all the rage” is – simply – price. For many SMBs, cost trumps functionality no matter how well us techies extol the virtues of any product.
Great read….I am working with a company that brings ECM functionality to the file server using XML objects. No need for SQL and no capacity planning issues. XML sits beside the file folders…..ping me if you want to learn more.
Mike,
Thanks for the insight. I want to implement MOSS to sites Across the world where some data is site specific and some data replicates between sites D & E but other data replicates between sites A & G etc. etc. (preferably using DFS?) Without this I wonder how useful sharepoint is.
I am currently having trouble even installing sharepoint onto 2008 server even following the convoluted process MS recommend so I am going all cold on SharePoint all together!
Thanks again.
filestrema datatype in sql 2008 and sharepoint next versions will take aadvantage in storeing large fies over 2gb and more out of sql to file systema .
thais wille a good one, but
Recently investigated DMS integration with SharePoint a little deeper, and I’ve discovered just how painfully bad all the current solutions out there are: RBS, EBS, DMS web parts, ISV providers…
None provide two way integration of the Blob AND it’s metadata, and none provide it for the TWO enterprise DMS systems required.
So a custom SharePoint solution is the only option (with all the inherent risks that involves) that may be irrelevant by SP2010.
Come on MS – SharePoint is meant to enable collaboration between users AND other systems!
Despite all the warnings issued with MOSS 2007 SP1, EBS lives in 2010. We have a version of our StoragePoint product that works with the 2010 Tech Preview.
SO CMIS support isn’t there OOTB in SP2010? That’s what I suspected, but had hoped for more…
Let us know what you find, as EBS really doesn’t fit my current needs. Looks like I’ll be doing my own cut-down CMIS implementation at the moment.
From the MS ECM blog last year:
http://blogs.msdn.com/ecm/archive/2008/09/09/announcing-the-content-management-interoperability-services-cmis-specification.aspx
“When will Microsoft include support for CMIS into SharePoint (or other products)?
Of course, Microsoft’s goal (which is shared by all of the companies participating in the CMIS effort) is for the CMIS specification is to become the interoperability standard that we can incorporate into our products to reduce the complexity of managing & integrating multiple ECM systems… and today’s announcement is an important step in that process.
As the specification goes through the OASIS Technical Committee process and approaches a final 1.0 version, we’ll provide more information on when and how you’ll see support for CMIS for SharePoint and other Microsoft products.
Ethan Gur-esh, Program Manager.”
Looks like the OASIS CMIS will be in SP2010: http://blogs.msdn.com/williamcornwill/archive/2009/10/19/microsoft-to-add-support-for-cmis-to-sharepoint-2010.aspx
Hopefully the SP2009 conference will give details of the ‘how’, so we know which path they are taking on this…
The need for “special” backup/restore procedures for external content storage as described above is not special or new. This has been the case for > 10 yrs with the big boy products (FileNet, Documentum, OpenText, etc.).
The key has always been that metadata is king. Therefore always backup metadata before content and always restore content before metadata. Nothing new. It also brings with it the potential of having orphaned files, for which the big boys have always had cleanup utilities.
One thing that will be useful for success with EBS and RBS is support for interfaces on top of SharePoint, which makes itself the interface for content storage when EBS and RBS is in use. The system associating the metadata with the content has to be the “entry point” for integration. Standards like CMIS will be helpful, but integration with big ERP (like SAP) require support for ArchiveLink and I suppose other interfaces/integration from other ERP will also be necessary.
On the topic of high-end (higher-end) document management, I’d be very interested in any information on throughput and capacity. The latest I heard was that SP 2010 was designed for up to 50 million document repositories is is a little “low” – particularly for use for records management. I later heard that 100 million was the top end. I’d be interested in some trustworthy hard data on the topic. Would also like to understand document import/export rates. Generally, the systems we have are able to deliver about 20-30 documents / second out and 10-20 documents / second in. You’d really need at least 5-10 / second as a minimum in either direction.
As per the criteria demanded by European Firms for Certification, especially in Germany by the TÜV, all Archived Documents are to stored on Media that cannot be modified later on. In other Words you have to store archived documents on WORM Storage.
Sharepoint stores documents in BLOBS which are not acccepted. In other words, while other document management systems are easily certified since they store their data on file systems and can thus archive data on WORM Media, Sharepoint cannot be used as a complete Document Management System.
It is a pity that Microsoft does not seem to be interested in what the real world needs but lives in its its own ivory castle.
I just wanted to add to this discussion that we have developed software called STEALTH Software Content Store that makes use of both EBS and RBS api to connect SharePoint with an external Storage environment. This external storage platform can be based on private (cloud) storage software from Caringo CAStor, ParaScale or Bizanga Store or if you prefer a public storage solution you can connect with Windows Azure, Amazon S3 and EMC Atmos. Opting for this storage infrastructure you are liberating your SharePoint platform from its SQL Server chains. When you upload content in SharePoint the STEALTH Software separates the metadata from the content (blob). The metadata are stored in SQL, the content goes in the external storage. This simple action has a major impact on SharePoint: performance is back at the level when you started with SharePoint, you can scale your SharePoint as much as you want to. If you want to put all your company content in SharePoint there is no barrier anymore that will hold you back. We have presented this ‘revolutionary’ infrastructure to companies who were struggling with their 200-300GB SQL storage. When management found out that they can finally make full use of SharePoint they decided to put all thier content in SharePoint creating a storage of several TB’s. Which ofcourse for external storage environments is not a big thing. For SQL it would have been a big issue. One of the other results that our customers liked was the very fast backup and restore times. With only metadata in SQL backup and restore times are more minutes than hours this to big relief of the people who are responsible for backup and SLA’s. Backup of content is not needed as the external storage environment automatically duplicates the content.
With the arrival of SharePoint 2010 things become a little more fun, because with RBS STEALTH Software Content Store can actually direct content to different storage environments. Say you want some content fast, than use a ParaScale or Bizanga Store environment, these guys have fast i/o’s, if your collegue of Archive is more interested in retention periods than he/she can store the content in a CAStor storage environment where CAStor takes care that the item is stored safely for 7 years. And if you have content that is not business critical but valuable enough to keep why not store that in Windows Azure. As we encrypt all SharePoint content with AES Advanced Encryption Standard 256 bit you don’t have to worry about security of public cloud storage, your content is already secured.
Bottomline of my story is that Microsoft indeed has not addressed the SQL storage issue with the arrival of SharePoint 2010 as Microsoft being a software company considers the connecting to all kinds of hardware a task for the hardware companies or for the software developers. STEALTH Software being a software developer has created a solution for it. A solution that will enable organizations to use SharePoint for the purpose that they got it in the first place. SharePoint (2010) is a worthy Enterprise Content Management system that users over the whole world love. It is actually bizarre that at the back-end SharePoint is ‘being held hostage’ by limited storage capabilities and where the main options are usually buy more expensive storage or preventing the upload of more content in SharePoint. I even heard of a situation at a big entreprise company that a mail was sent to users to delete content in order to lower the storage! Well, that is completely in line with the spirit of SharePoint…not! Using another (very expensive) content management system to be an add-on to SharePoint is actually also quite extraordinary, laying an unneccessary burden on the users (2 learning curves) and operational managers, not to mention the huge costs
There is a solution and it is actually not complex, but it requires guts to look at storage infrastructure in a different way, leaving the well-known traditional path. I have noticed also unbelieve: how can a storage infrastructure be only let’s say 30% of what we usually pay for? Well, what can I say: Times have changed. SharePoint is taking over Content Management world and The ‘Watercurtain’ Storage infrastructure will eventually replace the fileserver silo’s. Life will be less complex and cheaper..what else do you want?
If you would like more information, feel free to go to our site: http://www.synergy4.eu/resources/information-kit
PS Pat: by using external storage that is WORM-compliant like Caringo CAStor, EMC Centera, DELL DX Object Store and Hitachi HCAP you can use SharePoint as means to store your archived documents.