I’ve seen many companies trying to migrate from their legacy SAS platforms into modern cloud architectures like Databricks – and they all have similar issues. Legacy code that few people properly understand.
So, here’s my take on how you can migrate from SAS to Databricks – but could easily be adapted to other platforms. There are essentially 4 options:
1. Leave the legacy alone and just migrate the SAS datasets.
This can easily be achieved by exporting the sas7bdat files to your storage account and then using either pandas or a spark as library to load the data into a data frame. I would recommend the spark version as pandas will take up a lot of memory to create the data frame whereas the distributed nature of spark can spread the work a bit.
Pros: Very quick and easy to do.
Cons: This isn’t really a migration as the legacy SAS remains.
This can be a useful step to getting users used to the cloud environment using data they are familiar with before tackling one of the options below.
2. Use a SAS code conversion tool to convert to python
There are a few of these out there. Putting it simply, yes, they can convert the code. From experience, it usually covers 90% of it with the remaining 10% requiring some manual intervention. Sounds like a great idea right? Well, yes and no. If you’re code is properly commented and your engineers/analysts know what the code does, then it could be really useful. However, in my experience, many of the core datasets people use were written by someone who is no longer in the company. It may include hard coded values in transformations with no idea why that is there. It may have a slightly different way of calculating a measure then what is expected. So it really comes down to – how well do you know your code?
On top of that, converting SAS straight into python is not going to be an optimal route either. SAS is procedural. You write your programs in logical steps. With spark and SQL, things are more set based and more efficient. I have seen programs where steps are converted to cells in a notebook where all they do is rename or sort the data. Think how easy that is to do in SQL and you get the idea. It’s not saying the conversion won’t work, but there is still the penalty of inefficient code and technical debt.
Automatic conversions I have seen still require additional human effort. And if you have SAS reading from other SAS datasets and transformations, that can get quite complicated.
Pros: Code can be converted reasonably fast
Cons: Some refactoring still required. Also Technical debt and inefficiencies – and possibly errors depending on how well the code is known
3. Same as 2 but using LLMs
This is where things can get a little interesting. With the appropriate training and skills, an LLM would be able to convert that SAS code. But could you possibly take it one step further and ask it to explain the code (of course) and maybe make it more efficient. It’s the next logical approach.
I’ve had a quick go at converting SAS using Genie Code as is, on it’s own – just to see what happens, and it does an ok job. But it isn’t perfect and hallucinates easily. So getting the skills created with the appropriate prompts would be the next step.
There are better examples out there. I have seen one which creates a Databricks app using Streamlit which claims 70% to 90% successful conversion.
However, I fully expect that this approach will become the main route for conversions in the future as the skills start to be distributed on open source solutions.
Pros: Assuming skills in place, the code can be converted reasonably fast. Cheaper than 2.
Cons: Skills need to be created. You still have technical debt, even if you do have a decent explanation of what is happening.
4. Refactor the whole thing
So even if you have done 1 and 2 or 3 above – that technical debt is still there. The code was written many years ago when markets and products were different. So does that SAS dataset still give the insight that you need – or does it contain errors? Assuming you know what it does of course! And if the answer is yes – then happy days.
But if the answer is no - this is where the inevitable happens. Start again and refactor! Well, it’s probably not as bad as it sounds. As we are seeing Agentic AI handle a lot of basic engineering already, then most of the effort comes down to modelling the questions that people ask. We can also build up the business semantics so that we can give consistent answers and optimize the data for the world of AI as well.
The downside is of course that this could take a lot of effort if you have thousands of SAS scripts to converts. But, put in the time now and you get the advantages in the future. Create a project and a roadmap – and try to avoid distractions. The biggest challenge is explaining to the business why this is important. Using the savings of decommissioning SAS should only be part of the story.
Pros: You know what the data is doing with decent semantics and no technical debt.
Cons: This can take a lot of time and effort to do.
Conclusion
At the end of the day, no business is the same. It’s all about choosing the right option for you. But the effort to convert is getting smaller as the data world is getting smarter.
That is just my experience of it all. I haven’t seen anyone want to manually convert a SAS program – but it is of course possible.
If you have a different experience, or even another solution, I’d love to hear it. Let me know. And if you're considering your options and want some expert input from our team, get in touch.
Topics Covered :
Author
Paul McComish