NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere (pola.rs)
noworriesnate 2 days ago [-]
Every time I build something complex with dataframes in either R or Python (Pandas, I haven't used Polars yet), I end up really wishing I could have statically typed dataframes. I miss the security of knowing that when I change common code, the compiler will catch if I break a totally different part of the dashboard for instance.

I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.

Does anybody have ideas on how this situation could be improved?

[1] https://pandera.readthedocs.io/en/stable/

chrisaycock 2 days ago [-]
Statically typed dataframes are exactly why I created the Empirical programming language:

https://www.empirical-soft.com

It can infer the column names and types from a CSV file at compile time.

Here's an example that misspells the "ask" column as if it were plural:

  let quotes = load("quotes.csv")
  sort quotes by (asks - bid) / bid
The error is caught before the script is run:

  Error: symbol asks was not found
I had to use a lot of computer-science techniques to get this working, like type providers and compile-time function evaluation. I'm really proud of the novelty of it and even won Y Combinator's Startup School grant for it.

Unfortunately, it didn't go anywhere as a project. Turns out that static typing isn't enough of a selling point for people to drop Python. I haven't touched Empirical in four years, but my code and my notes are still publicly available on the website.

noworriesnate 1 days ago [-]
Wow this is amazing!! Thanks for sharing!

I love how you really expanded on the idea of executing code at compile time. You should be proud.

You probably already know this but for people like me to switch "all" it would take would be:

1. A plotting library like ggplot2 or plotnine

2. A machine learning library, like scikit

3. A dashboard framework like streamlit or shiny

4. Support for Empirical in my cloud workspace environment, which is Jupyter based, and where I have to execute all the code, because that's where the data is and has to stay due to security

Just like how Polars is written in Rust and has Python bindings, I wonder if there's a market for 1 and 2 written in Rust and then having bindings to Python, Empirical, R, Julia etc. I feel like 4 is just a matter of time if Empirical becomes popular, but I think 3 would have to be implemented specifically for Empirical.

I think the idea of statically typed dataframes is really useful and you were ahead of your time. Maybe one day the time will be right.

theLiminator 2 days ago [-]
Does this require that the file is available locally or does it do network io at compile time?
chrisaycock 2 days ago [-]
The inferencing logic needs to sample the file, so (1) the file path must be determined at compile time and (2) the file must be available to be read at compile time. If neither condition is true---like the filename is a runtime parameter, for example---then the user must supply the type in advance.

There is no magic here. No language can guess the type of anything without seeing what the thing is.

briankelly 2 days ago [-]
Scala Spark - a bit absurd if you don't need the parallelism, though. Most of the development can be done simply in quick compilation iterations or copied from the sbt REPL. Python/pandas feels Stone Age in comparison - you absolutely waste a lot of time iterating with run-time testing.
Centigonal 2 days ago [-]
Why scala spark over pyspark?
smu3l 2 days ago [-]
Scala (and Java) has a typed Dataset api.[0] pyspark only provides the Dataframe API, which is not typed.

[0] https://spark.apache.org/docs/latest/sql-programming-guide.h...

Centigonal 1 days ago [-]
thanks!
akdor1154 2 days ago [-]
The pandas mypy stubs attempt to address this to some extent, but to be honest.. It's really painful. Not helped by pandas' hodgepodge API design to be fair, but i think even a perfect API would still be annoying to statically type. Imagine needing to annotate every function that takes a data frame with 20 columns...

A tantalising idea i have not explored, is to try and hook up polars' lazy query planner to a static typing plugin. The planner already has basically complete knowledge of the schema at every point, right?

So in theory this could be used to give the really good inference abilities that a static typing system needs to be nice to use.

theLiminator 2 days ago [-]
Depends, it's resolved at runtime, so there's no way to have truly "compile-time" static schema (unless you specify a schema upfront).
TheTaytay 2 days ago [-]
I agree with this so much! I recently started using patito, which is a typesafe pydantic based library for Polars. I’m not really deep into it yet, but I prefer polars syntax and the extra functions that Patito adds to the dataframes. (https://patito.readthedocs.io/en/latest/)

Otherwise, it feels so broken to just pass a dataframe around. It’s like typing everything as a “dict” and hoping for the best. It’s awful.

enugu 2 days ago [-]
Polars is also usable as a Rust library. So, one can use that for static typing. Wonder what the downsides are - maybe losing access to the Python data science libraries.
antonvs 1 days ago [-]
Polars dataframes in Rust are still dynamically typed. For example:

    let df = df![
        "name" => ["Alice", "Bob", "Charlie"],
        "age" => [25, 30, 35]
    ]?;

    let ages = df.column(“age”)?;
There’s no Rust type-level knowledge of what type the “age” or “name” column is, for example. The result of df.column is a Series, which has to be cast to a Rust type based on the developer’s knowledge of what the column is expected to contain.

You can do things like this:

    let oldies = df.filter(&df.column("age")?.gt(30)?)?;
So the casting can be automatic, but this will fail at runtime if the age column doesn’t contain numeric values.

One type-related feature that Polars does have is because the contents of a Series is represented as a Rust value, all values in a series must have the same type. This is a constraint compared to traditional dataframes, but it provides a performance benefit when processing large series. You can cast an entire Series to a typed Rust value efficiently, and then operate on the result in a typed fashion.

But as you said, you can’t use Python libraries directly with Polars dataframes. You’d need conversion and foreign function interfaces. If you need that, you’d probably be better off just using Python.

lmeyerov 1 days ago [-]
Pandas, dask, etc use also have runtime typed cols (dtypes), which is even stronger in pandas 2 and when used with arrow to go to data representation typing for interop/io. (Half of the performance trick of polars.)

And yeah my ??? with all these is, lacking dependent typing or equivalent for row types, it's hard for mypy and friends to statically track individual columns existing and being specific types. And even if we are willing to be explicit about wrapping each DF with a manual definition, basically an arrow schema, I don't think any of these libraries make that convenient? (And is that natively supported by any?)

In louie.ai, we generate python for users, so we can have it generate the types as well... But we haven't found a satisfactory library for that so far...

enugu 1 days ago [-]
Thanks, I am in the process of choosing a dataframe library and just naively assumed that the Rust interface would be statically typed.
dharmatech 2 days ago [-]
Frames is a type safe dataframe library for Haskell:

https://hackage.haskell.org/package/Frames

ants_everywhere 2 days ago [-]
I agree, and I suspect there are large numbers of unknown bugs in a lot of data frame based applications.

But to do it right you'd need a pretty good type system because these applications implicitly use a lot of isomorphisms between different mathematical objects. The current solution is just to ignore types and treat everything as a bag of floats with some shape. If you start tracking types you need a way to handle these isomorphisms.

jamesblonde 1 days ago [-]
If you use a feature store to store your DataFrames (most provide APIs for storing Polars, Pandas, PySpark DataFrames in backing Lakehouse/real-time DBs), then you get type checks when writing data to the DataFrame's backing Feature Group (Lakehouse + real-time tables).

Many also add an additional layer of data validation on top of schema validation, using frameworks like Great Expectations. For example, it's not enough to know 'age' is an Integer, it should be an integer in the range 0..150.

Disclaimer: i work for Hopsworks.

Centigonal 2 days ago [-]
It's really not the same as inbuilt strict typing, but we addressed this issue by running all of our "final" data products through a Great Expectations[1] suite that was autogenerated from a YAML schema.

[1] https://docs.greatexpectations.io/docs/core/introduction/gx_...

Starlord2048 2 days ago [-]
I can appreciate the pain points you guys are addressing.

The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.

I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.

The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.

Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.

scrlk 2 days ago [-]
Ibis also solves this problem by providing a portable dataframe API that works across multiple backends (DuckDB by default): https://ibis-project.org/
ritchie46 1 days ago [-]
Disclosure, I am the author of Polars and this post. The difference with Ibis is that Polars cloud will also manage hardware. It is similar to Modal in that sense. You don't have to have a running cluster to fire a remote query.

The other is that we are only focussing on Polars and honor the Polars semantics and data model. Switching backends via Ibis doesn't honor this, as many architectures have different semantics regarding NaNs, missing data, order of them, decimal arithmetic behavior, regex engines, type upcasting, overflowing, etc.

And lastly, we will ensure it works seamlessly with the Polars landscape, that means that Polars Plugins and IO plugins will also be first class citizens.

TheTaytay 1 days ago [-]
It’s funny you mention Modal. I use modal to do fan-out processing of large-ish datasets. Right now I store the transient data in duckdb on modal, using polars (and sometimes ibis) as my api of choice.

I did this, rather than use snowflake, because our custom python “user defined functions” that process the data are not deployable on snowflake out of the gate, and the ergonomics of shipping custom code to modal are great, so I’m willing to pay a bit more complexity to ship data to modal in exchange for these great dev ergonomics.

All of that is to say: what does it look like to have custom python code running on my polars cloud in a distributed fashion? Is that a solved problem?

ritchie46 6 hours ago [-]
Yes, you can run

`pc.remote(my_udf, schema)`

Where

`def my_udf() -> DataFrame`

We link the appropiate Python version at cluster startup.

ZeroTalent 2 days ago [-]
I've played around a bit with ibis for some internal analytics stuff, and honestly it's pretty nice to have one unified api for duckdb, postgres, etc. saves you from a ton of headaches switching context between different query languages and syntax quirks. but like you said, performance totally depends on the underlying backend, and sometimes that's a mixed bag—duckdb flies, but certain others can get sluggish with more complex joins and aggregations.

polars cloud might have an advantage here since they're optimizing directly around polars' own rust-based engine. i've done a fair bit of work lately using polars locally (huge fan of the lazy api), and if they can translate that speed and ergonomics smoothly into the cloud, it could be a real winner. the downside is obviously potential lock-in, but if it makes my day-to-day data wrangling faster, it might be worth the tradeoff.

curious to see benchmarks soon against dask, ray, and spark for some heavy analytics workloads.

theLiminator 2 days ago [-]
My experience with it is that it's decent, but a "lowest-common denominator" solution. So you can write a few things agnostically, but once you need to write anything moderately complex, it gets a little annoying to work with. Also a lot of the backends aren't very performant (perhaps due to the translation/transpilation).
Starlord2048 2 days ago [-]
wow, ibis supports nearly 20 backends, that's impressive
codydkdc 2 days ago [-]
without locking you into a single cloud vendor ;)
0cf8612b2e1e 2 days ago [-]
I’ll bite- what’s the pitch vs Dask/Spark/Ray/etc?

I am admittedly a tough sell when the workstation under my desk has 192GB of RAM.

orlp 2 days ago [-]
Disclaimer: I work for Polars Inc, but my opinions are my own.

If you have a very beefy desktop machine and no giant datasets, there isn't a strong reason to use Polars Cloud.

Are you a data scientist running a Polars data pipeline against a subsampled dataset in a notebook on your laptop? With just changing a couple lines of code you can run that same pipeline against your full dataset on a beefy cloud machine which is automatically spun up and spun down for you. If you have so much data that one machine doesn't cut it, you can start running distributed.

In a nutshell, the pitch is very similar to Dask/Ray/Spark, except that it's Polars. A lot of our users say that they came for the speed but stayed for the API, and with Polars Cloud they can use our API and semantics on the cloud. No need to translate it to Dask/Ray/Spark.

dwagnerkc 1 days ago [-]
they came for the speed but stayed for the API

This is exactly how I would describe my experience. When I talk to others about polars now I usually quickly mention its fast up front, but then mostly talk about the API, its composability, small surface area, etc. are really what make it great to work with. Having these same semantics backed by eager execution, query optimized lazy API, streaming engine, GPU engine, and now distributed auto-magical ephemeral boxes in the sky engine just make it that much better of a tool.

gardnr 1 days ago [-]
Being both eager and lazy does make it sound magical.
fastasucan 2 days ago [-]
I think being able to run the same code locally and on the "cloud" is a great selling point. Developing on Spark feels hillariously ineffective.
benrutter 2 days ago [-]
Doesn't look like benchmarks are there yet, but knowing polars, I'd guess performance will be front and centre.

I think the best selling point speaks to your workstation size- just start with polars vanilla. It'll work great for ages, and if you do need to scale, you can use polars cloud.

That solves what I see as one if the big issues with a lot of these types of projects, which is the really poor performance at smaller sizes, meaning practically you end up using completely different frameworks based on size, which is a bif hassle if you want to rewrite in one direction.

2 days ago [-]
__mharrison__ 2 days ago [-]
Yeah, you can process 99% of tabular workloads with that. I generally advise my clients to work on a single node before attempting to scale out.
film42 2 days ago [-]
I think this will be a hit with the big name audit companies. I know some use databricks for pyspark on the M&A side. As deals move forward and they get more data, they have to scale up their instances which isn't cheap. If polars enables serverless compute where you pay by the job, that could be a big win.

And sure, databricks has an idle shutdown feature, but suppose it takes ~6 hours to process the deal report, and only the first hour needs the scaled up power to compute one table, and the rest of the jobs only need 1/10th the mem and cores. Polars could save these firms a lot of money.

serced 2 days ago [-]
May I ask what part in M&A needs this much data processing? I am quite familiar with the field but did not yet see such tasks.
lmeyerov 2 days ago [-]
I thought databricks has serverless recently already? Or is it by the notebook, while this is by the job?
Centigonal 2 days ago [-]
Databricks supports serverless for both interactive notebooks and jobs.
tfehring 2 days ago [-]
The obvious one is that you can handle bigger workloads than you can fit in RAM on a single machine. The more important but less obvious one is that it right-sizes the resources needed for each workload, so you're not running an 8GB job on an 8TB machine, and your manually-allocated 8GB server doesn't OOM when that job grows to 10GB next year.
2 days ago [-]
LaurensBER 2 days ago [-]
This is very impressive and definitely fills a huge hole in the whole data frame ecosystem.

I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!

robertkoss 2 days ago [-]
Love it! Competition for Databricks is always appreciated and I think having a competitor that is not running on the JVM is amazing. Working with polars feels always insanely lightweight compared to Spark. If you would provide Workflows / Scheduling out of the box, I would migrate my Spark jobs today :)
jt_b 1 days ago [-]
Polars seems cool, but not willing to invest in adoption until Geo support is more mature. I find I'm preferring to run most operations I'd use dataframe libraries for in local SQL via DuckDB anyways.
tfehring 2 days ago [-]
This is really cool, not sure how I missed it. I assume catalog support will be added fairly quickly. But ironically I think the biggest barrier to adoption will be the lack of an off-ramp to a FOSS solution that companies can self-host. Obviously Polars itself is FOSS, but it understandably seems like there's no way to self-host a backend to point a `pc.ComputeContext` to. That will be an especially tough selling point for companies that are already on Spark. I wonder how much they'll focus on startups vs. trying to get bigger companies to switch, and whether they'll try a Spark compatibility layer like DataFusion (https://github.com/apache/datafusion-comet).
orlp 2 days ago [-]
Disclaimer: I work for Polars Inc, but my opinions are my own.

Polars itself is FOSS and will remain FOSS.

Self-hosted/on-site Polars Cloud is something we intend on developing as there is quite a bit of demand, but it is unlikely to be FOSS. It most likely will involve licensing of some sort. Ultimately we do have to make money, and we intend on doing that through Polars Cloud, self-hosted or not (as well as other ventures such as offering training, commercial support, etc).

tfehring 2 days ago [-]
Yep I totally get it and would probably go the same route in Polars' situation. Just sharing how some of the data teams I'm familiar with would likely be thinking about the tradeoffs.
Larrikin 2 days ago [-]
As a hobbyist, I describe polars as pandas if it was planned for humans to use. It's great to use, I just hate running into issues trying to use it. I wish them luck
marquisdepolis 2 days ago [-]
This is very interesting, clearly there's a major pain point here to be addressed, especially the delta between local pandas work and distributed [pyspark] work!

Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (https://github.com/bodo-ai/Bodo), but FOSS all the way.

efxhoy 2 days ago [-]
Looks great! Can I run it on my own bare metal cluster? Will I need to buy a license?
__mharrison__ 2 days ago [-]
Really excited for the Polars team. I've always been impressed by their work and responsiveness to issues I've filed in the past. The world is lifted when there is good competition like this.
whyho 2 days ago [-]
How does this integrate into existing services like aws glue? I fear that despite polars being good/better it will lack adoption since it cannot easily be integrated.
th0ma5 2 days ago [-]
I think this is the main problem with this, like H2O offers Spark integration as well their own clustering solution, but most people with this problem have their own opinionated and bespoke needs.
TheAlchemist 2 days ago [-]
Having switched from Pandas to Polars recently, this is quite interesting and I guess performance wise it will be excellent.
2 days ago [-]
melvinroest 2 days ago [-]
I just got into data analysis recently (former software engineer) and tried out pandas vs polars. I like polars way more because it feels like SQL but then sane, and it's faster. It's clear in what it tries to do. I didn't really have that with pandas.
epistasis 2 days ago [-]
I've been doing data analysis for decades, and stayed on R for a long time because Pandas was so bad.

People complain about R, but compared to the multitude of import lice and unergonomic APIs in Pandas, R always felt like living in the future.

Polars is a much much more sane API, but expressions are very clunky for doing basic computation. Or at least I can't find anything less clunky than pl.col("x") or pl.literal(2) where in R it's just x or 2.

Still, I'm using Python a ton more now that polars has enough steam for others to be able to understand the code.

Centigonal 2 days ago [-]
R's data.table is still my favorite data frames API, over pandas, polars, and spark dataframes. Plotly has edged out ggplot2, but that took a long time.

IMO R is really slept on because it's limited to certain corners of academia, and that makes it seem scary and outdated to compsci folks. It's really a lovely language for data analysis.

minimaxir 2 days ago [-]
> Or at least I can't find anything less clunky than pl.col("x") or pl.literal(2) where in R it's just x or 2.

In many cases you can pass a string or numeric literal to a Polars function instead of the pl.col (e.g. select()/group_by()).

Overall I agree it's less convenient than in dplyr in the cases where pl.col is required, sure, but not terrible and has the benefit of making the code less ambigious which reduces bugs.

epistasis 2 days ago [-]
I think compsci people can appreciate R as a language itself, because it has really beautiful language features. I think programmers hate it, because it's so different and lispy, with features that they can't really appreciate when coming from a C-style OOP mindset.
theLiminator 2 days ago [-]
I think if that's too painful, you can introduce a convention of: ``` from polars import col as c, lit as l ```

For anything production though, I just stick to pl.col and pl.lit as it's widely used.

minimaxir 2 days ago [-]
Coming from R, that introduces a different confusion problem as there, c() has a specific and common purpose. https://www.rdocumentation.org/packages/base/versions/3.6.2/...
epistasis 2 days ago [-]
Even then, the overhead of having an additional five characters per named variable is really unergonomic. I don't know of a way to get around it given Python's limited grammar and semantics without moving to something as Lispy as R.
orlp 1 days ago [-]
Two characters, if you do `from polars import col as c` you can simply write `c.foo`, assuming the column name is a valid Python identifier.
epistasis 1 days ago [-]
Oh that's very interesting, thanks!!
BrenBarn 2 days ago [-]
The thing with Polars is it's really hard for me to get past the annoyance of having to do `pl.col("blah")` instead of `df.blah`. I find pandas easier for quick interactive work which is basically everything I do with it.
ritchie46 1 days ago [-]
import polars.col as C

C.blah

prometheon1 1 days ago [-]
Thanks! I'm not sure if pl.col improved since the last time I looked at polars or if I was too lazy to find it, but pl.col (docs) look great!
minimaxir 2 days ago [-]
This may be a hot take, but there is now no reason to ever use pandas for new data analysis codebases. Polars is better in every way that matters.
latenightcoding 2 days ago [-]
pandas has been around for years and never tried to sell me a service.
theLiminator 2 days ago [-]
Their (polars) FOSS solution isn't at all neuteured, imo that's a little bit of an unfair criticism. Yeah, they are trying to make their distributed query engine for-profit, but as a user of the single-node solution, I haven't been pressured at all to use their cloud solution.
melvinroest 2 days ago [-]
Sure, just wanted to give the perspective of a new person walking into this field. I'd agree, but I think there are a lot of data analysts that have never heard of polars.

Though, I guess they're not on this site :')

The-Ludwig 2 days ago [-]
Only thing I can think of is HDF5 support. That is currently stoping me from completely switching to polars.
comte7092 2 days ago [-]
It’s a bit of a hot take, but not wildly outlandish either.

Pandas supports so many use cases and is still more feature rich than polars. But you always have the polars.DataFrame.to_pandas() function in your back pocket so realistically you can always at least start with polars.

babuloseo 2 days ago [-]
I applied :D just now hehehe
c7THEC2DDFVV2V 2 days ago [-]
who covers egress costs?
ritchie46 2 days ago [-]
The cluster runs in your own VPC.
otteromkram 2 days ago [-]
How is this not an advertisement? Does HN tag those or nah?
whalesalad 2 days ago [-]
Never understood these kinds of cloud tools that deal with big data. You are paying enormous ingress/egress fees to do this.
ritchie46 2 days ago [-]
Disclosure, I wrote this post. The compute-plane (cluster) will run in your own VPC.
tfehring 2 days ago [-]
That's almost certainly the main reason they're offering this on all 3 major public clouds from day 1.
tomnipotent 2 days ago [-]
> You are paying enormous ingress/egress fees to do this.

It looks like their offering runs on the same cloud provider as the client, so no bandwidth fees. Right now it looks to be AWS, but mentions Azure/GCP/self-hosted.

unit149 2 days ago [-]
[dead]
ydjje 2 days ago [-]
[flagged]
DrBenCarson 2 days ago [-]
False, very random SO link that has been posted repeatedly
marxisttemp 2 days ago [-]
What does this project have to do with Serbia? They’re based in the Netherlands. They must have made a mistake when registering their domain name.
ritchie46 2 days ago [-]
Nothing. Polars -> pola.rs

The Polars name and a hint to the .rs file extension.

marxisttemp 1 days ago [-]
I’m aware. I personally wouldn’t want to tie my infrastructure, nor provide funding, to the government of Serbia at this particular juncture in geopolitical time, but hey, you gotta have a cutesy ccTLD hack or you aren’t webscale.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 21:14:45 GMT+0000 (UTC) with Wasmer Edge.