Multithreading - Computer Technical Help - Software and Hardware Forum

Two Plus Two Forums Other Topics Computer and Technical Help

Multithreading

Post Reply Subscribe

...

11-20-2016 , 04:30 PM

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

I have a general question about multithreading. My intuitive picture is that it's assigning tasks to a bunch of workers so that they can work on things in parallel and thus take less time overall. For example, if there are 1000 screws to tighten, it takes much less time for a team of 100 workers to do it than it would take for just one.

But when it comes to how this is actually accomplished within the framework of a processor, what's actually happening and how does one figure out the limits of the benefit? For example, it seems that it would be possible try to have more workers than the CPU can actually handle, and so you don't actually benefit because of some sort of processor bottleneck. Where do the workers find the processing power to get their jobs done? And is there a way of estimating the time savings in advance?

More specific to the project I'm working on, I'm just playing around with making a program to search for solutions to a Rubik's cube. (I can already solve it readily, and I'm just playing around with the programming.) Since each set of moves can be checked independently of each other, this seems like a reasonable multithreading task. I'm doing this on a Raspberry Pi (because why not), so there are definitely hardware speed issues. Right now, I get 15,000-20,000 checks per second (in a space with 10^19 positions), so the individual tasks themselves are very short. My intuitive understanding tells me that this might be a case where the worker has finished the job before I've even assigned the task to the next one. But I don't even know if that's the right idea.

Quote

11-20-2016 , 04:46 PM

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

I don't know that there's so much a way to "figure it out" as much as there is to try it.

To avoid obscuring the issue too much, you can run more than one thread on a single cpu machine. Whether or not this will give a benefit depends on whether your task is CPU-bound or not. For example, lots of programs spend a lot of time doing stuff like talking over the network or writing to/reading from disk, or just waiting. These might have very high degrees of parallelization. Other tasks might be nearly CPU bound and more than 1x/cpu would not yield a benefit.

There's also a cost per thread in RAM and switching overhead. These days that is perhaps not such an issue.

A rule of thumb I often heard was 2x the number of CPUs but obviously this can't be anything more than a guess. Usually 1x the number of CPUs is a reasonable *minimum* although even that is not a given.

How the tasks are scheduled and how memory is passed between them, and whether/how locking is done can have a big effect also. I have seen multi-threaded implementations that performed worse than their single threaded version because of problems like these. There are also archictecture decisions, like it is pretty common in modern multiprocessor machines for some of the CPUs to share cache RAM. It would be best to have processes or threads that communicate via RAM to be on the same core since they benefit from the locality.

A few jobs ago we had an extremely parallel system (thousands of processes across hundreds of machines per cluster) and we had several people who's nearly full time job was to look for problems with how we were apportioning and managing workers.

All this said, it's very rare to get N times the performance by mutliplying the number of available CPUs or cores by N, each additional CPU tends to have a diminishing return. Maybe you're lucky to get 2x the performance out of 4 cores as 1 for example. So in a sense it doesn't matter *that* much to get the number perfect, because being off by a little might not affect your overall timing that much.

Quote

11-20-2016 , 07:18 PM

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

Thanks for the reply. My program is definitely CPU-intensive. The steps are to perform a permutation and then check that permutation against a list of about 40,000 "known" positions. Those positions are stored in RAM, so there's not a lot of time spent reading data off of the hard drive.

As I was thinking about this some more, it occurred to me to SSH two separate instances into the computer and run two separate instances of the program. This probably isn't multithreading per se, but it's at least asking the system to manage two of these processes simultaneously. At the level of the OS, at least, it's trying to do two of the same task simultaneously. Both programs are running noticeably slower.

Flying on my intuition alone, if the task involved a lot of down time, then both programs out to have been running at roughly the same speed as when only one was logged in. But since it's CPU-intensive and trying to run two instances of it is really slowing things down, is that a bad sign for multithreading? Or am I doing something that's too different to make for a fair comparison?

Quote

11-20-2016 , 07:28 PM

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

You're doing both of these on the raspi? Do you know which incarnation it is? The earliest ones were single core (and really very slow)

If you're running 2 seperate instances and each of them is slower than normal, but the machine has several cores, then that may be a sign that the program relies on some kind of resource that is not always available. I can't readily speculate on that. I would recommend benchmarking it and trying to answer the question "how many units can I do with 1 process, how many with 2, how many with 4" per minute. The number of units per process may decrease but the overall performance can still go up.

There are really lots of factors - programming language and methodology are going to matter, for example.

Quote

11-20-2016 , 07:41 PM

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

You're doing both of these on the raspi? Do you know which incarnation it is? The earliest ones were single core (and really very slow)

Yes. It's a Model B+ from about two years ago.

Quote:

I would recommend benchmarking it and trying to answer the question "how many units can I do with 1 process, how many with 2, how many with 4" per minute. The number of units per process may decrease but the overall performance can still go up.

That's a good idea. I'll try to set that up later tonight.

Quote:

There are really lots of factors - programming language and methodology are going to matter, for example.

Well, that was going to be my next line of questions if it was determined that I might actually see a realistic benefit from doing this. I know that there are some inefficiencies in my existing code, so I was going to focus on cleaning those up, first. That, and I've only seen a brief tutorial on threading and am likely going to need to spend some more time doing some reading and thinking about how to make this work right. There are only a couple steps, so it shouldn't be too complicated.

If there's anything immediately useful/important that can be said, I'm writing this in Python3.

Quote

11-20-2016 , 07:46 PM

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

A model 1 B+? They are slow as hell, and I think single-core. The raspi 2s are like 10x faster and the raspi 3s are about 2x faster than that. Pretty sure the 3s are quad core, not sure about the 2s.

Quote

11-20-2016 , 08:23 PM

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

A model 1 B+? They are slow as hell, and I think single-core.

I was already aware of the slow as hell part.

I'm definitely not breaking any speed records. But that was never really the goal.

I've never worked with multi-threading before, and the thought came to me somewhat randomly as I was waiting for it to finish a test run. But maybe this is just something I'll learn at some point in the future. I've got other projects to play with and random things that I can learn about.

Thanks for the input.

Quote

11-20-2016 , 08:45 PM

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

It's probably something that would benefit from multi-threading, but not on a machine with just 1 CPU. If one thread already uses ~100% of a CPU there is nothing to be gained from a 2nd thread

Quote

11-20-2016 , 09:44 PM

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

It's probably something that would benefit from multi-threading, but not on a machine with just 1 CPU. If one thread already uses ~100% of a CPU there is nothing to be gained from a 2nd thread

Well, since it's still rolling around in my head, I might as well ask and learn something, even if I'm not going to implement it right away.

The program itself is pretty straight-forward. I've got the check process (the one that gets repeated over and over) broken down into 3 basic steps:

1) Generate a permutation to apply to the unsolved cube. This is done algorithmically, so that given any permutation has a well-defined "next" permutation. This generation is therefore a linear task and can't be multithreaded.
2) Apply the permutation to the unsolved cube. (We can make lots of copies of the cube, so the different threads won't need to be accessing the same information.)
3) Check whether the resulting cube configuration matches with something in a database of known configurations. (There's just one database, but the values are all fixed so data corruption can't happen there.)

The normal pattern is to loop around from 1-3 over and over again. There are two places where the multithreading seems to fit.

A) Multithread the calculation of the permutations
B) Multithread the checking step

I have three different conceptual pictures in my head.

P1) I have one line of workers and each of them are checking in after step 1. Each worker does steps 2 and 3 with their given permutation. (Each worker gets their own cube, twists it a few times, and then walks through the entire library to see if there's a match.)
P2) I have one line of workers and each of them are checking in after step 2. Each worker does step 3 but checks a different part of the list. (This is like giving copies of the cube to the workers, and each worker checks a different section of the library for a match.)
P3) I have two lines of workers. One checks in after step 1 and the checks in after step 2. So one group is constantly generating arrangements of cubes to check and the other group is running to the library to check. (There's one group that is constantly making a stack of cubes, and then a team of people taking one cube at a time and each one is checking a different part of the library)

The first and second pictures seem roughly identical conceptually. I already have the task written out and just branch out from that point forward. The third picture seems like it would require a much more sophisticated change to the programming because I'm now creating two distinct types of tasks and the latter is dependent upon the completion of the first. I assume that the third picture is possible, but maybe it's not? I don't even know if that's a good idea because unless it's perfectly tuned, there will always be a back-log. Either the workers will be waiting for more cubes to check, or the cubes will be piling up. But again, this is just the picture in my head.

In practice, is there a rule of thumb for deciding where/how to implement multi-threading? Here's what I believe about the program.

* I suspect the actual process involved in steps 1 and 2 are slow. But since Step 1 is linear, that can't be multithreaded.
--- Step 2: Each turn of the cube involves changing either 20 or 40 variables. Each permutation will have more and more steps as the search depth gets deeper. Even if the search is only 4-5 moves deep, there's already going to be about 100+ variable manipulations involved.
* The individual process of checking at step 3 is fast.
--- Step 3: If the cube is correctly solved, it will check 48 different positions. But the vast majority of the time it will check just a couple positions before finding it's not a match, and the program will immediately leave the check process and return a negative match. This part is only slow because there are tens of thousands of positions in the library.

My intuition is that the gain is likely better in the slower processes because of the overhead cost of just routing the workers. Is this a reasonable way of thinking about it?

Quote

11-20-2016 , 10:03 PM

#10

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

You will get the best performance and the simplest implementation if the threads don't need to coordinate much with each other. P3 would be kind of a nightmare I think.

I don't know a lot about rubik's cube solving but it's possible your algorithm should be reconsidered for threading use. One thing that comes to mind is to consider the starting point, and make a list of every possible single move. For each of those, make a list of each possible single move. I don't know how many moves there are but now you have a list of dozens to hundreds of configurations. Parcel these out one at a time to each thread and then have that thread continue as normal.

Like I said, I dunno much about the problem space so maybe that doesn't make any sense.

Quote

11-20-2016 , 11:18 PM

#11

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

You will get the best performance and the simplest implementation if the threads don't need to coordinate much with each other. P3 would be kind of a nightmare I think.

Cool. At least my intuition is in the right ballpark.

Quote:

I don't know a lot about rubik's cube solving but it's possible your algorithm should be reconsidered for threading use.

It's on my "to do eventually" list. Learning about multithreading is just for educational purposes, as I don't do enough programming in my job to necessitate the development of this skill. I'm just playing around because it's interesting. It's a climbing the mountain because it's there sort of thing.

If I were to try to do this as a multithreaded project, I'd probably use my work computer and run things overnight or something like that.

Quote:

One thing that comes to mind is to consider the starting point, and make a list of every possible single move. For each of those, make a list of each possible single move. I don't know how many moves there are but now you have a list of dozens to hundreds of configurations.

That's pretty much what I did to build the list of known configurations. If you consider a "move" to be some combination of turning the cube around one of the three axes, there are 45 different moves (15 along each axis). But because I don't want to repeat two moves in a row on the same axes, there are only 45*30 = 1350 moves at the next level, then 45*30*30 = 40,500 moves three deep. The next one up is about a million positions. The total space of configurations is on the order of 43 quintillion. So any brute force method (which is what I'm doing) is necessarily going to be slow.

One improvement that's going to save me a few percent on each search is that right now it's checking all positions in the database. But once I know that the original one isn't in the database, I really only need to be looking at the three deep moves. Basically, it's like a bubble around the solved position. I can't get inside of that bubble without crossing the boundary, so I should really only be checking the boundary. That's an immediate 3-4% improvement right there. My current list of known positions is only at 3 deep and uses up about 7 megs (that's the size of the file when I pickle it). So it probably could handle going up another factor of 30. (I think there's only a half gig of RAM, so I couldn't go any bigger without using hard drive space, which would then introduce the time it takes to read data as a new timing variable.)

What I'm actually planning to do with it when I get it to a good place is that it will only check some of the pieces to see if there are better algorithms than the ones that I use to do certain things, or maybe find new algorithms to do things in a different order or something like that. (Someone else has probably already done it, but that's not the point.)

Quote:

Parcel these out one at a time to each thread and then have that thread continue as normal.

Like I said, I dunno much about the problem space so maybe that doesn't make any sense.

What that would look like (at least how I've got things set up) would be 45 threads out of the gate. They would definitely be checking different branches, so there's not going to be much redundancy until you get several moves out. It would be interesting to implement it and just compare the results. Eventually...

For now, I'm just enjoying the process of problem-solving and picking a few odds and ends along the way. Again, I appreciate your insights.

Quote

11-20-2016 , 11:26 PM

#12

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Oh are you using python? (I think so because of your mention of pickling). Python multithreading is very problematic because of something called the "global interpreter lock" which essentially prevents 2 commands from running at the same time in different threads (this is a little simplistic but the takeaway is, MT is not great in python)

There are some alternative approaches that use the same or similar interfaces, such as the "multiprocessing" library (which simulates multithreading via multiple processes) and "greenlets" which simulate (sort of) multithreading within one process. Greenlets are more like... a single threaded thing with multiple procs running at once within it, and whenever one of them blocks, the other running ones continue. It's great for stuff where you're IO or time bound.

If you have N pieces of work, you don't necessarily need N threads. A common method splitting up work is to make some threads and give each of them access to a job queue. As long as there is a job in the queue, they will take one off, process it, and maybe post a result back to the result queue. So you can have 5 worker threads and a queue 100 jobs deep and they just keep plugging until the queue is empty. Making 100 threads is really unlikely to get the job done faster.

Quote

11-21-2016 , 02:24 PM

#13

Aaron W.

Carpal \'Tunnel

Join Date: Sep 2002 Posts: 30,132

Quote:

Originally Posted by RustyBrooks

Oh are you using python? (I think so because of your mention of pickling).

Yeah. I mentioned I was using Python3 earlier, but you probably just missed it in the wall of text that I created.

Quote:

Python multithreading is very problematic because of something called the "global interpreter lock" which essentially prevents 2 commands from running at the same time in different threads (this is a little simplistic but the takeaway is, MT is not great in python)

Good to know. This is literally what I'd be doing. I've got everything written as function calls.

Quote:

If you have N pieces of work, you don't necessarily need N threads. A common method splitting up work is to make some threads and give each of them access to a job queue. As long as there is a job in the queue, they will take one off, process it, and maybe post a result back to the result queue. So you can have 5 worker threads and a queue 100 jobs deep and they just keep plugging until the queue is empty. Making 100 threads is really unlikely to get the job done faster.

This wouldn't work for starting different threads down different search paths. The problem is that the search paths are quite deep and run until they find the solution (which could be a long time). If there are 45 starting paths, but I only have 10 workers, there will basically be 35 unsearched paths because the workers would basically never return home to start searching down a new path.

(Sort of... With enough moves, the search will circle back to catch those other paths. But the shortest loop to get back there that I'm aware of is 6 moves deep, which is almost 1 billion checks later.)

But it would work for sending workers to do these shorter checks because they'll be able to come back home for another assignment.

Quote

11-21-2016 , 02:28 PM

#14

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Hm then I guess I'd recommend breaking the starting points into N packages and having each thread skip between them maybe. This has problems too though.

I honestly have not tried starting a ****load of threads recently - I know some modern OSes have gotten better at that, but threads used to be very expensive memory-wise. In most cases I have gotten away with a dozen or less threads.

BTW although threading is great, for real performance clustering is kind of where it's at. The problems of communicating between threads get even worse (because the threads now may on different machines) but the number of distinct threads you grow is limited only by your computing resources. I remember using our dev cluster at work in it's offtime to compute poker stuff (the dev cluster was around 100 high performance machines connected with gigabit LAN connections)

Quote

02-07-2017 , 02:11 AM

#15

Laserbeak

stranger

Join Date: Feb 2017 Posts: 4

In Mac OS X (and iOS) you can do it very simply (in Objective-C):

[NSThread detachNewThreadSelector:@selector("mySubroutine") toTarget:myObject withObject:objectToSend];

The system will then run the "mySubroutine" method in the object myObject and send it objectToSend as an argument in a new thread.

As said, only multicore, HyperThreading and/or multiple processor machines will really benefit.

Quote

Post Reply Subscribe

...