Programming homework and newbie help thread - Page 78 - Computer Technical Help

There used to be big telephone book sized manuals for the major DB vendors but now it’s all online docs. You might try looking for text books or YouTube videos. I used a kindle book called Jump Start MySQL to brush up on joins and MySQL and it was pretty good, walked through setting up the DB locally and data sets for the queries they were teaching.

Quote

08-01-2018 , 06:46 PM

#1927

sixfour

should be called sevenfour

Join Date: Dec 2005 Posts: 74,751

Quote:

Originally Posted by well named

I feel like this should exist, but I don't actually know of any that I really like. I learned the basics a long time ago, and since then I usually just cobble together whatever refresher I need from random StackOverflow questions and db engine documentation.

Yeah, I'm basically self taught from forums and understanding fundamentals back from learning Pascal and earlier BASIC (yep, we're all old), from what I've read it seems pretty straightforward but until I start doing stuff in practice I probably won't know for sure

Quote

08-01-2018 , 06:48 PM

#1928

_dave_

_Pooh_Bah_

Join Date: Feb 2005 Posts: 13,146

postgresql.org documentation is very good.

Quote

08-01-2018 , 06:51 PM

#1929

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Most DBs have decent docs but they aren't really how-tos, especially in terms of "how should I design my database" or "how should I write clear and performant queries"

I learned from an online guide called "SQL for Cave Men" by Philip Greenspun, originally on photo.net, but that seems to be long gone. I don't know what you'd use now. The SQL for Cave Men thing started from the ground up and had you designing schema and queries for a system sort of piece by piece. It was oracle-centric but it was really extremely useful.

Quote

08-01-2018 , 10:31 PM

#1930

CBorders

Carpal \'Tunnel

Join Date: Feb 2008 Posts: 16,505

I'm mostly familiar with Postgres but here are a few sites I've used to learn more about SQL:

https://sqlzoo.net/
https://sqlbolt.com/
http://www.postgresqltutorial.com/
https://pgexercises.com/
http://postgresguide.com/
https://use-the-index-luke.com/

Last edited by CBorders; 08-01-2018 at 10:47 PM.

Quote

08-02-2018 , 08:27 AM

#1931

Loki

binary duo

Join Date: Nov 2015 Posts: 12,772

I think codecademy still has an intro sql course.

Also, Odin Project’s sql portion is pretty informative iirc

Quote

08-02-2018 , 08:54 AM

#1932

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

There's no magic bullet for SQL. Learning the basic joins etc is all pretty easy, learning to do more complex things is a matter of learning various idioms, windowing functions, etc, which all takes experience.

Quote

08-02-2018 , 11:08 PM

#1933

jmakin

banned

Join Date: Jan 2008 Posts: 30,120

SQL is on its way out anyway, dont worry

Quote

08-02-2018 , 11:55 PM

#1934

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by jmakin

SQL is on its way out anyway, dont worry

I'm such a dinosaur I don't even know if this is a funny or not

Quote

08-03-2018 , 01:27 AM

#1935

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

NoSQL is a bold new paradigm, it has reached a new and permanently high plateau.

Quote

08-03-2018 , 10:31 AM

#1936

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by ChrisV

NoSQL is a bold new paradigm, it has reached a new and permanently high plateau.

I've used several nosql platforms, but I don't really see them replacing SQL. Is that something that people really think will happen?

We use an in-house graph database for some stuff. I understand they are planning to rip it out and replace it with postgres.

Quote

08-03-2018 , 07:32 PM

#1937

jmakin

banned

Join Date: Jan 2008 Posts: 30,120

I'm probably biased because of what I do/where I work but RDBMS's typically don't scale well for today's computing needs whcih are ever-growing and queries frequently need to be done in near real-time speeds.

In theory, it can scale, but in practice, that's entirely different. There are a lot of reasons for this but I'll probably get called out for spouting BS. You can research it, there's a lot out there.

Quote

08-03-2018 , 08:56 PM

#1938

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Quote:

Originally Posted by jmakin

A RDBMS is the right tool for some jobs and not the right tool for others, sure. You don't use an array for everything, you don't use a hashmap for everything, etc.

But I think there are a lot of things best suited to a relational database that people are using nosql databases for, because they don't know, or because they're following a trend, or because they think it doesn't matter, or because it seems harder, or because it's a prototype, etc, etc.

I think people use nosql databases for a lot more wrong reasons than right ones.

ETA: I've been using "nosql" databases for longer than I've used sql ones - at least 25 years. I've written more than one myself, I've used several home grown ones, and I've used 2 modern professional strength ones. I've also spent a lot of the last 20 years using oracle, postgres, mysql, mssqlserver, sqlite, MS access, and informix. I may have left some off.

I'm currently working on a system that currently has tens of thousands of concurrent users, expected to scale to millions of concurrent users. Tables have millions of rows. Most endpoint requests do somewhere between 1 and 5 reads, and most will do at least one write. Response times are currently around 5ms on very modest hardware, single database server with no replication or clustering.

ETA 2: you know what, I forgot about redis and memcache so I guess 4 nosql databases.

Last edited by RustyBrooks; 08-03-2018 at 09:16 PM.

Quote

08-03-2018 , 10:05 PM

#1939

jmakin

banned

Join Date: Jan 2008 Posts: 30,120

noSQL is such a loaded term because it can encompass such a vast variety of DB's - file based, like mongoDB, column based, like druid, or pure K/V stores - I deal mostly with the latter - and you're right, YMMV depending on what your specific use case is.

For what I'm doing though 5ms would be glacially slow. We shoot for 30-50 microseconds for most operations even under really heavy loads. Total response time would definitely be only limited by the network bandwidth. That's what we shoot for.

You could build some type of SQL-based DBMS on top of our software - but I think it would still suck.

Quote

08-03-2018 , 10:33 PM

#1940

RustyBrooks

Carpal \'Tunnel

Join Date: Feb 2006 Posts: 24,647

Yeah I kind of laughed at myself internally with the 5ms number (although that is end-to-end, the entire HTTP transaction, not a single query or even all the query times together. Individual queries are under 1ms, but I don't know how much under because our reporting infrastructure reports in integer multiples of ms)

2 jobs ago I worked for a HFT company and we were given the opportunity to be market makers in currency trading. We'd get a flow of potential trades, we had to fill a specific percentage of them, and we were given 1ms to make trade decisions. When I heard this, I laughed. 1ms? Did they want us to check our work 10 times? (that job btw had at least one in-house nosql database (a timeseries database, and probably at least one other)

Quote

08-31-2018 , 10:30 AM

#1941

d7o1d1s0

Coach NLHE

Join Date: May 2010 Posts: 2,459

What's up guys, I am trying to write a web scraper to extract a bunch of old blog posts from my PGC threads, so I can post them to my wordpress site.

As a step 1, I'm trying to extract a list of URLs that correspond to individual posts that I made. Example URL in this screenshot:

So, the URL I want is contained within a table, I can identify that table through it's second tr tag -> td -> div -> a, which has attribute href="..../230247" (this corresponds to my username on the forum)

I've watched the sentdex beautifulsoup tutorial on youtube twice, but I'm struggling to figure out how to navigate through and extract this information. Here's what I have so far:

Code:

import bs4 as bs
import urllib.request
import re

url = 'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

##posts = []
##for post in soup.find_all('table', id=re.compile("post")):
##    posts.append(post.name)
##print(posts)
#### This returns a list of all the tables that are posts

##for url in soup.find_all('a', href=re.compile("230247")):
##    print(url.get('href'))
#### This finds all links that correspond to my user and prints the URLs

##my_urls = soup.find('a', href=re.compile("230247"))
##
##for parent in my_urls.parents:
##    if parent is None:
##        print(parent)
##    else:
##        print(parent.name)
#### This prints the navigation path from a relevent a tags up (doesn't work with find_all, all types are bs4.element)

Any help/pointers in the right direction greatly appreciated

Cheers

Quote

08-31-2018 , 11:06 AM

#1942

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

I could do that in about 5 seconds in jQuery, in fact here it is:

let links = $("table[id^='post']").has("a[href$='members/230247/']").find("tr:first-child a[id^='postcount']");

Unfortunately I don't speak Python or beautifulsoup. What I don't know how to do is that "has" part, which filters the list of tables based on whether they have a descendant element of that type. Maybe someone can translate.

Quote

08-31-2018 , 11:19 AM

#1943

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

One approach that might work fairly well is to just use the find_all() method to get all the elements that match your user profile URL, then use find_parent() to get to the table, and from there you should be able to call find() on the table to get the sub-element you want.

Quote

08-31-2018 , 11:25 AM

#1944

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

Or find_sibling

Quote

08-31-2018 , 11:50 AM

#1945

d7o1d1s0

Coach NLHE

Join Date: May 2010 Posts: 2,459

thanks so much for your help guys. was just coming back here to edit my post, made something of a breakthrough:

Code:

import bs4 as bs
import urllib.request
import re

url = 'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')


all_my_urls = soup.find('a', href=re.compile("230247"))
for parent in all_my_urls.parents:
    if parent.name == 'table':
        post_link = parent.find('a', href=re.compile("post"))
        break

print(post_link.get('href'))

>>>
https://forumserver.twoplustwo.com/s...33&postcount=1

Unfortunately this breaks when I chance soup.find to soup.find_all

But I'm thinking there'll be some hacky thing where I can just get this to run multiple times slicing through the finds..

Quote

08-31-2018 , 03:42 PM

#1946

Sugar Nut

Carpal \'Tunnel

Join Date: Nov 2007 Posts: 20,891

Note: I am very much an amateur, but the code below outputs raw postcounts of all your posts on the first page of that thread.

Code:

I was confused at first, because there are 25 of your posts on the first page, not only the 5 which the script outputs. But that's for me, logged into 2p2, viewing the forum at 100ppp. Your script will not be logged in and thus only view 25ppp. So it only finds 5 of your posts.

You'll have to cycle through the pages, and then append all the raw postnumbers to the url to scrape all your posts from that thread.

Code:

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup


url = "https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/"
user_id = "230247"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

myPosts = []

for user in soup.find_all("a", {"class" : "bigusername"}):
    if user_id in user["href"]:
        myPosts.append(user.parent["id"].split("_")[-1])

for post in myPosts:
    print(post)

If the for-loop confuses you (especially the last line), here's what it does:

We've stored all "a" tags that are of class : "bigusername" in a temporary variable called "user".
We check if it's you (230247)
Then, to get only the raw post numbers, we go to this tag's parent, which looks like this:

Code:

<div id="postmenu_xxxxxxxx">

We take its id tag, split it into a list separated at the underscore, and then take this list's last item ([-1]) to append to our list "myPosts".

A final tip: Don't hardcode "url" and "user_id". Use sys.argv instead. You never know when you might want to scrape another thread, or another user's posts.

Quote

08-31-2018 , 03:52 PM

#1947

Sugar Nut

Carpal \'Tunnel

Join Date: Nov 2007 Posts: 20,891

Oh, and "user_id" should be "userId" to make it more "pythonic" if that's something you care about.

Quote

08-31-2018 , 06:27 PM

#1948

d7o1d1s0

Coach NLHE

Join Date: May 2010 Posts: 2,459

God damn, it's beautiful! Thanks very much man, I've read and understood this (I think), gonna try and recreate it from scratch just now. Definitely need to look into OOP stuff, my plan is to eventually build this out into something that can loop through multiple pages of multiple threads (and be useful for multiple users)

Feels good to be a losing 2nl player again lol

Quote

08-31-2018 , 11:23 PM

#1949

Sugar Nut

Carpal \'Tunnel

Join Date: Nov 2007 Posts: 20,891

Quote:

Originally Posted by d7o1d1s0

Definitely need to look into OOP stuff

I didn't use any OOP stuff above, if that's what you think. In any event, for a simple scraper like this you won't really need classes.

Go to your python interactive shell (where you can type commands and they get executed immediately), and type the following:

Code:

import this

And read its output.

Then watch this fun and informative talk.

Quote

09-07-2018 , 12:41 PM

#1950

d7o1d1s0

Coach NLHE

Join Date: May 2010 Posts: 2,459

Thanks very much for that, guess I'll hold off on that stuff till I have a real need for it

Still working away with this, here's where I'm up to now (if this is boring/ not the intention of this thread please do tell me to GTFO)

Code:

import bs4 as bs
import urllib.request
import re


urls = [
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/',
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/200k-2014-crushing-msnl-1428843/',
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2018-pgc-back-1702907/'
]
#the urls to be scraped (not yet setup to loop through them, has to run on 1 thread at a time)

userId = '230247'
#OP's user number, could also read this from html


url = urls[0]

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
#sets up the beautifulsoup

############################################

a_list = []
menu_tds = soup.find_all('td', {"class": "vbmenu_control"})
substring = 'Page'

for td in menu_tds:
    if substring in td.text:
        a_list.append(td.text.split()[-1])
        break

length = int(a_list.pop())
##this finds the length of the thread (for the program, which is not logged in)

##########################################

my_posts = []

my_urls = soup.find_all('a', {"class": "bigusername"}, href=re.compile(userId))

for item in my_urls:
    my_posts.append(item.parent["id"].split("_")[-1])

for i in range(2,length+1):

    sauce = urllib.request.urlopen(url+'index'+str(i)+'.html').read()
    soup = bs.BeautifulSoup(sauce, 'lxml')
    
    my_urls = soup.find_all('a', {"class": "bigusername"}, href=re.compile(userId))

    for item in my_urls:
        my_posts.append(item.parent["id"].split("_")[-1])

my_post_urls = []

for item1 in my_posts:
    my_post_urls.append('https://forumserver.twoplustwo.com/showpost.php?p='+item1)

text = str(my_post_urls)

saveFile = open('2p2_PGC_urls_1.txt', 'a')
saveFile.write(text)
saveFile.close()

This code searches through 1 thread at a time and saves a .txt file containing a list of all the individual posts made by the user specified.

Next steps (in order of difficulty):
- Verify it's returning all my posts (102 for the first thread, know there's a way to check this on the forum but haven't found it yet)
- Loop through all the threads in URLs rather than just 1 at a time
- Find the OP's userID automatically from the html of the thread
- Build something to access the comprehensive list of post URLs and save the content (or parts of it) to [date-the-post-was-made].txt

Enjoying this a little more than is really healthy

Quote

Page 78 of 80

First

28 58 68 73 74 75 76 77 78 79 80

Last

Post Reply Subscribe

...

Page 78 of 80

First

28 58 68 73 74 75 76 77 78 79 80

Last