Skip to content

add nobs #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

add nobs #3

wants to merge 2 commits into from

Conversation

CarloLucibello
Copy link

@CarloLucibello CarloLucibello commented Jul 28, 2021

Over at https://github.com/JuliaML/LearnBase.jl
We are using LearnBase.getobs and StatsBase.nobs to create a generic api for datasets. It would be nice to remove the dependence from StatsBase there and rely on StatsAPI instead.

Notice that the docstring in StatsBase denotes that the method is meant to query a model instead of a dataset, so I add a docstring here to expand the interpretation

@nalimilan
Copy link
Member

Sorry for the delay. It would indeed make sense to move all empty function definitions related to statistic models from StatsBase to StatsAPI.

However, regarding the particular case of nobs, I'm kind of worried by the definitions at https://github.com/JuliaML/LearnBase.jl/blob/d31c281ac7f0a4608569c3eeab5b2c7682f3bd6d/src/observation.jl#L146-L174 StatsBase.nobs(data::AbstractArray; obsdim::Union{Int,Nothing}=nothing) and StatsBase.nobs(data::Union{Tuple, NamedTuple, AbstractDict}; obsdim ::Union{Int,Nothing} = default_obsdim(data)) are technically type piracy so generally speaking they deserve a thorough discussion to ensure that definitions are OK for all relevant packages.

What makes the problem more complex is that for Tables.jl, vectors of named tuples are tables (with one row per entry), which conflicts at least conceptually with the method defined for any AbstractArray. Even though Tables.jl doesn't define nobs and doesn't plan to currently, some types like DataFrame could decide to implement it if we extend that method beyond model objects. Also, in Tables.jl but also in other stats packages (Statistics, Distances, GLM...), observations are stored as rows, but AFAICT default_obsdim(data) defaults to 2 (observations as columns) for arrays.

Is that right? To what extent are these particular aspects important for LearnBase? It would be nice to find a common ground to avoid ecosystem fragmentation.

Cc: @quinnj @bkamins @kleinschmidt

@bkamins
Copy link

bkamins commented Sep 10, 2021

I agree that it would make sense to define nobs as a common function. However, it requires a very careful consideration to do it properly. Intuitively such a function should be coupled with two functions like (names are tentative and on purpose long to make clear what I mean):

  • isobservationcollection (testing if some object provides a set of observations)
  • getobservationsiterator (returning a generator of observations of length nobs)
    Then, when I get some object I am sure how it can be interpreted and this interpretation can be queried without contextual knowledge.

@nalimilan
Copy link
Member

#4 added all modeling functions, including nobs. But we still need to find a common definition for it.

@CarloLucibello
Copy link
Author

We stopped pirating nobs and now have a numobs in https://github.com/JuliaML/MLUtils.jl since it seemed tricky and too fundamental to find a solution satisfying every need. Custom types can overload both, or we can have numobs fall back to nobs at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants