Absurd Workflows: Durable Execution with Just Postgres

lucumr.pocoo.org

126 points by ingve 3 days ago

phs318u 4 hours ago

Wow. Everything old is new again. I built a business state machine for a bespoke application using Oracle 8i and their stateful queues back in 2005. I had re-architected a batch-driven application (which couldn't scale temporally i.e. we had a bunch of CPU sitting near idle for a lot of the time), and turned it into an event driven solution. CPU usage became almost a horizontal line, saving us lots of money as we scaled (for the record, "scale" for this solution was writing 5M records a day into a partitioned table where we kept 13 months of data online, and then billed on it). Durable execution was just one of the many benefits we got out of this architecture. Love it.

the_mitsuhiko 3 hours ago

It's quite funny in a way for me because even back in the Cadence days I thought it was the hottest shit ever, but it was just too complex to run for a small company and cadence was not the first (SWF and others came before). It felt like unless you had really large workflows you would ignore these systems entirely. And now, due to the problems that agents pose, we're all in need of that.
I'm happy it's slowly moving towards mass appeal, but I hope we find some simple solutions like Absurd too.

mfrye0 11 hours ago

I've been keeping an eye on this space for awhile as it matures a bit further. There's been a number of startups that have popped up around this - apart from Temporal and DBOS, Hatchet.run looked interesting.

I've been using BullMQ for awhile with distributed workers across K8 and have hacked together what I need, but a lightweight DAG of some sort on Postgres would be great.

I took a brief look at your docs. What would you say is the main difference of yours vs some of the other options? Just the simplicity of it being a single sql file and a sdk wrapper? Sorry if the docs answer this already - trying to take a quick look between work.

the_mitsuhiko 11 hours ago

> I took a brief look at your docs. What would you say is the main difference of yours vs some of the other options? Just the simplicity of it being a single sql file and a sdk wrapper? Sorry if the docs answer this already - trying to take a quick look between work.
It's really just trying to be as simple as possible. I was motivated by trying to just do the most simple thing I could come up with after I did not really find the other solutions to be something I wanted to build on.
I'm sure they are great, but I want to leave the window open to having people self host what we are building / enable us to deploy a cellular architecture later and thus I want to stick to a manageable number of services until until I can no longer. Postgres is a known quantity in my stack and the only postgres only solution was DBOS which unfortunately did not look ready for prime time yet when I tried it. That said, I noticed that DBOS is making quite some progress so I'm somewhat confident that it will eventually get there.
- jedberg 9 hours ago
  
  Could you provide some more specifics as to why DBOS isn’t “ready for prime time”? Would love to know what you think is missing!
  FWIW DBOS is already in production at multiple Fortune 500 companies.
  - the_mitsuhiko 3 hours ago
    
    > Could you provide some more specifics as to why DBOS isn’t “ready for prime time”? Would love to know what you think is missing!
    Some time in September I was on a call with Qian Li and Peter Kraft and gave them direct feedback. The initial reason this call happened was related to a complaint of mine [1] about excessive dependencies on the Python client which was partially remedied [2] but I felt it should be possible to offload complexity away from the clients even further. My main experiments were pretty frustrating when I tried it originally because everything in the client is global state (you attach to the DBOS object) which did not work for how I was setting up my app. I also ran into challenges with passing through async and I found the step based retry not to work for me.
    (There are also some other oddities in the Python client in particular: why does the client need to know about Flask?)
    In the end, I just felt it was a bit too early and I did not want to fight that part of the infrastructure too much. I'm sure it will get there, and I'm sure it has happy users. I was just not one of them.
    [1]: https://x.com/mitsuhiko/status/1958504241032511786
    [2]: https://x.com/qianl_cs/status/1971242911888281608
  - biasafe_belm 8 hours ago
    
    I'd love to hear both of your thoughts! I'm considering durable execution and DBOS in particular and was pretty happy to see Armin's shot at this.
    I'm building/architecting a system which will have to manage many business-critical operations on various schedules. Some will be daily, some bi-weekly, some quarterly, etc. Mostly batch operations and ETL, but they can't fail. I have already designed a semblance of persistent workflow in that any data ingestion and transformation is split into atomic operations whose results are persisted to blob storage and indexed in a database for cataloguing. This means that, for example, network requests can be "replayed", and data transformation can be resumed at any intermediate step. But this is enforced at the design stage, not runtime like other solutions.
    My system also needs to be easily auditable and written in Python. There are many, many ways to build this (especially if you include cloud offerings) but, like Armin, I'm trying to find the simplest architecture possible so our very small team can focus on building and not maintaining.

stevefan1999 2 hours ago

Did anyone have a new approach to do this kind of transactional workflow? I heard that Saga patterns also define invertibility as well but I want a more general framework that also does all of this in one.

Also, I noticed how durable execution actually have so much to do with Continuation-passing style, is my intuition correct?

saadatq 12 hours ago

Somebody said this the other day on HN, but we really are living in the golden age of Postgres.

eximius 5 hours ago

This is pretty great! The main thing you need for durable execution is 1) retries (absurd does this) 2) idempotency (absurd does this via steps - but would be better handled with the APIs themselves being idempotent, then not using steps. Though absurd would certainly _help_ mitigate some APIs not being idempotent, but not completely).

the_mitsuhiko 3 hours ago

> idempotency (absurd does this via steps - but would be better handled with the APIs themselves being idempotent, then not using steps
That is very hard to do with agents which are just all probabilistic. However if you do have an API that is either idempotent / uses idempotency keys you can derive an idempotency key from the task: const idempotencyKey = `${ctx.taskID}:payment`;
That said: many APIs that support the idempotency-key header only support replays of an hour to 24 hours, so for long running workflows you need to capture the state output anyways.

rodmena 10 hours ago

Armin, I managed to review absurd.sql and the migrations. I am so impressed that I am rewriting the state management of my workflow engine with Absurd. Just wanted to thank you for sharing it with us. I'll keep you posted of the outcome.

rodmena 11 hours ago

I think it's a brilliant idea. Absurd can be a very good match to highway_dsl as well (which is a domain-specific-language, for workflows)

https://github.com/rodmena-limited/highway_dsl?tab=readme-ov...

motoboi 11 hours ago

Restate was built for agents before agents were cool.

Surprisingly haven take off yet when agents is all we are looking for now.

oulipo2 3 days ago

Really cool! How does it compare to DBOS ? https://docs.dbos.dev/architecture

the_mitsuhiko 3 days ago

I'm sure with time DBOS will be great, I just did not have a lot of success with it when I tried it. It's quite complex, the quality of the SDKs was not overly amazing (when I initially used it, it had a ton of dependencies in it) and it just felt early.

crabmusket 9 hours ago

Not to be confused with https://github.com/jlongster/absurd-sql (note the hyphenation)

oulipo2 3 days ago

Other question: why reimplementing your framework, rather than using an existing agent framework like Claude + MCP, or OpenAI + tool calling? Is it because you're using your own LM models, or just because you wanted more control on retries, etc?

the_mitsuhiko 3 days ago

There are not that many agent frameworks around at the moment. If you want to be provider independent you most likely either use pydantic AI or the vercel AI SDK would be my guess. Neither one have built-in solution for durable execution so you end up driving the loop yourself. So it's not that I don't use these SDKs, it's just that I need to drive the loop myself.
- oulipo2 3 days ago
  
  Okay very clear! I was saying that because your post example is just a kind of basic "tool use" example which is already implemented by MCP/OpenAI tool use, but obviously I guess your code can be suited to more complex scenarios
  Two small questions:
  1. in your README you give this example for durable execution:
  const shipment = await ctx.awaitEvent(`shipment.packed:${params.orderId}`);
  I was just wondering, how does it work? I was more expecting a generator with a `yield` statement to run "long-running tasks" in the background... otherwise is the node runtime keeping the thread running with the await? doesn't this "pile up"?
  2. would your framework be suited to long-running jobs with multiple steps? I have sometimes big jobs running in the background on all of my IoT devices, eg:
  for each d in devices: doSomeWork(d)
  and I'd like to run the big outerloop each hour (say), but only if the previous one is complete (eg max num of workers per task = 1), and that the inner-loop be some "steps" that can be cached, but can be retried if they fail
  would your framework be suited for that? or is that just a simpler use-case for pgmq and I don't need the Absurd framework?
  - the_mitsuhiko 3 days ago
    
    > Okay very clear! I was saying that because your post example is just a kind of basic "tool use" example which is already implemented by MCP/OpenAI tool use, but obviously I guess your code can be suited to more complex scenarios
    That's mostly just because I found that to be the easiest way to use any existing AI API to work. There are things like vercel's AI SDK which internally runs the agentic loop in generateText, but then there is no way to checkpoint that.
    > I was just wondering, how does it work? I was more expecting a generator with a `yield` statement to run "long-running tasks" in the background... otherwise is the node runtime keeping the thread running with the await? doesn't this "pile up"?
    When you `awaitEvent` or `sleepUntil`/`sleepFor` it sets a wake point or sets a re-schedule on the database. Then it raises `SuspendTask` and ends the execution of the task temporarily until it's rescheduled.
    As for your IOT case: yes, you should be able to do that.
  - oulipo2 3 days ago
    
    Ah, got it, it throws Exception in order to stop the task each time https://github.com/earendil-works/absurd/blob/main/sdks/type...
- jedberg 3 days ago
  
  > If you want to be provider independent you most likely either use pydantic AI ... Neither one have built-in solution for durable execution
  PydanticAI has DBOS built in [0].
  [0] https://ai.pydantic.dev/durable_execution/dbos/
  - the_mitsuhiko 3 days ago
    
    Oh interesting, maybe this makes for a better example then. If it has DBOS and Temporal it must be exposing some way to drive the loop. I'll investigate.
  - aitchnyu 2 days ago
    
    Did Pydantic jump into both observability business (Logfire) and AI?
    
    jedberg a day ago
    
    Yes! They have logfire and the AI framework (and they play nicely together and with DBOS).
    https://docs.dbos.dev/integrations/logfire

andrewstuart 9 hours ago

Reminder that Postgres does not have a monopoly on SKIP LOCKED

You can do that in Oracle, SQL server and MySQL too.

In fact you might be able to replicate what Armin is doing with SQLite because it too works just fine as a queue though no via SKIP LOCKED.

SrslyJosh 12 hours ago

Durable execution paired with an unpredictable text generator? Sign me up! /s