Recreating Random Distributions without analysis - Science, Math and Philosophy Forum

Two Plus Two Forums Other Topics Science, Math, and Philosophy

Recreating Random Distributions without analysis

Post Reply Subscribe

...

10-15-2020 , 09:47 AM

Morphismus

BBV Poker Player of 2022

Join Date: Feb 2008 Posts: 24,268

Say you observe certain Random values, e.g. in an experiment, and want to re-create that kind of randomness in a simulation of said experiment. What you can do is first find out which distribution your observation obeys to (say Normal, Exponential, Poisson, etc.), and then compute the appropriate parameters (like μ & σ for normal distribution) for that distribution that best fit your observation. Now you can generate random numbers for your simulation that behave very close to the real observations in the experiment.

There are tools that help you find such a distribution and its parametrization; so far so good. But what if you just can't get close enough that way? It's probably possible to use a combination of different distributions to get there, but I am wondering if there is a method which skips the analysis altogether and deals with the problem in sort of a numerical, brute-force manner.

That is, is there an algorithm which upon input of N observed values can generate random values that are (reasonably similar) distributed like the input values, whatever that distribution might be?

Last edited by Morphismus; 10-15-2020 at 10:12 AM.

Quote

10-15-2020 , 02:05 PM

stremba70

adept

Join Date: Aug 2020 Posts: 1,105

Probably not the most elegant or efficient solution, but should give something close, especially if you have a fairly large experimental data set:

1. Determine the range of your experimental data.
2. Divide that range into some number of equal sub intervals. For example, if the data range from 0-100, perhaps you would divide it into 0-5, 5-10, ..., 95-100.
3. Determine the frequency at which experimental data appear in each sub interval. Eg. 0.02 for 0-5, 0.04 for 5-10, etc.
4. Generate a random number between 0 and 1. Use the frequencies from step 3 to determine which sub interval is selected by that number. For instance 0-0.02 would correspond to the 0-5 sub interval. 0.02<x<=0.06 to the 5-10 sub interval, etc.
5. Generate a random number for your simulated data set that falls in the sub interval selected in step 4. Repeat steps 4 and 5 until you generate the desired number of random numbers.

By playing with the size of the sub intervals you should be able to approximate the experimental distribution to a reasonable accuracy level. I have no idea how difficult this would be to implement in practice, but in theory it should work.

Quote

10-15-2020 , 03:29 PM

Morphismus

BBV Poker Player of 2022

Join Date: Feb 2008 Posts: 24,268

Quote:

Originally Posted by stremba70

Thanks; that's essentially based on the histogram, right? Yeah, I think that should work to a certain extend, although I'm a bit skeptical about the fixed bin widths, but I guess that could be tweaked. The thing I thought there was some standard statistical method for this, but I'm not able to find anything.

Quote

10-16-2020 , 06:11 AM

masque de Z

Carpal \'Tunnel

Join Date: Aug 2009 Posts: 9,961

How many data points are we talking about here? Dozens, hundreds, thousands, millions? Do you know the potential set of distribution functions that can be the data coming from or is it completely unknown? Do you start knowing basically nothing? Cant you plot if you have many data points a histogram of sufficiently small bin size and then do a numerical Fourier series fit of the resulting step functions or get some polynomial fit or some polynomial times exponential fit ? Then use the resulting series as your probability distribution function (then integrate etc) to simulate properly a new set of point?

Last edited by masque de Z; 10-16-2020 at 06:19 AM.

Quote

10-16-2020 , 06:42 AM

masque de Z

Carpal \'Tunnel

Join Date: Aug 2009 Posts: 9,961

For fun https://en.wikipedia.org/wiki/Johnso...U-distribution

Quote

10-16-2020 , 08:48 AM

Morphismus

BBV Poker Player of 2022

Join Date: Feb 2008 Posts: 24,268

Quote:

Originally Posted by masque de Z

How many data points are we talking about here? Dozens, hundreds, thousands, millions?

thousands to millions (network requests)

Quote:

Originally Posted by masque de Z

Do you know the potential set of distribution functions that can be the data coming from or is it completely unknown? Do you start knowing basically nothing?

It's very close to an exponential distribution, but not quite; it tends to produce more small values.

Quote:

Originally Posted by masque de Z

Cant you plot if you have many data points a histogram of sufficiently small bin size and then do a numerical Fourier series fit of the resulting step functions or get some polynomial fit or some polynomial times exponential fit ? Then use the resulting series as your probability distribution function (then integrate etc) to simulate properly a new set of point?

Isn't that essentially what stremba suggested? The thing is I already came up with something which is similar to what you suggested, and seems to work, but it feels like someone should have come up with that already, as it looks somewhat fundamental. While I have a mathematics degree, Statistics & Probability were never my strong suit, so I hoped the Statistics-savvy 2+2ers here might recognize some standard solution to this problem. Maybe it's just an application not many people need?

Quote

10-17-2020 , 02:04 PM

BrianTheMick2

Long way to go and a short time to get there.

Join Date: May 2012 Posts: 19,410

What are you going to be using this random generated stuff for?

Quote

10-17-2020 , 03:12 PM

Morphismus

BBV Poker Player of 2022

Join Date: Feb 2008 Posts: 24,268

Basically to create variations of previous real situations and then test software against it. Say the software has to handle events that occur essentially randomly and I have data on that of previous field usage; I can then create synthetic events in a simulation to test e.g. how the software can handle n times the load with the same distribution, how it scales etc.

Quote

10-17-2020 , 06:56 PM

BrianTheMick2

Long way to go and a short time to get there.

Join Date: May 2012 Posts: 19,410

Quote:

Originally Posted by Morphismus

Ok, cool. My kid tries to break things at work too.

Quote

10-22-2020 , 07:05 PM

#10

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Why not just use the existing data set as your distribution, as if you're doing a bootstrapping sample?

Or do that plus add some amount of random variation. It can be normally distributed noise (pick an appropriate standard deviation) and you'll still have the same overall distribution.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

Quote

10-24-2020 , 05:26 AM

#11

Morphismus

BBV Poker Player of 2022

Join Date: Feb 2008 Posts: 24,268

Quote:

Originally Posted by Aaron W.

Thanks, that is interesting! I wasn't even aware of resampling...

Quote

Post Reply Subscribe

...