Open Side Menu Go to the Top
Register
Programming homework and newbie help thread Programming homework and newbie help thread

08-01-2018 , 05:53 PM
There used to be big telephone book sized manuals for the major DB vendors but now it’s all online docs. You might try looking for text books or YouTube videos. I used a kindle book called Jump Start MySQL to brush up on joins and MySQL and it was pretty good, walked through setting up the DB locally and data sets for the queries they were teaching.
Programming homework and newbie help thread Quote
08-01-2018 , 06:46 PM
Quote:
Originally Posted by well named
I feel like this should exist, but I don't actually know of any that I really like. I learned the basics a long time ago, and since then I usually just cobble together whatever refresher I need from random StackOverflow questions and db engine documentation.
Yeah, I'm basically self taught from forums and understanding fundamentals back from learning Pascal and earlier BASIC (yep, we're all old), from what I've read it seems pretty straightforward but until I start doing stuff in practice I probably won't know for sure
Programming homework and newbie help thread Quote
08-01-2018 , 06:48 PM
postgresql.org documentation is very good.
Programming homework and newbie help thread Quote
08-01-2018 , 06:51 PM
Most DBs have decent docs but they aren't really how-tos, especially in terms of "how should I design my database" or "how should I write clear and performant queries"

I learned from an online guide called "SQL for Cave Men" by Philip Greenspun, originally on photo.net, but that seems to be long gone. I don't know what you'd use now. The SQL for Cave Men thing started from the ground up and had you designing schema and queries for a system sort of piece by piece. It was oracle-centric but it was really extremely useful.
Programming homework and newbie help thread Quote
08-01-2018 , 10:31 PM
I'm mostly familiar with Postgres but here are a few sites I've used to learn more about SQL:

https://sqlzoo.net/
https://sqlbolt.com/
http://www.postgresqltutorial.com/
https://pgexercises.com/
http://postgresguide.com/
https://use-the-index-luke.com/

Last edited by CBorders; 08-01-2018 at 10:47 PM.
Programming homework and newbie help thread Quote
08-02-2018 , 08:27 AM
I think codecademy still has an intro sql course.

Also, Odin Project’s sql portion is pretty informative iirc
Programming homework and newbie help thread Quote
08-02-2018 , 08:54 AM
There's no magic bullet for SQL. Learning the basic joins etc is all pretty easy, learning to do more complex things is a matter of learning various idioms, windowing functions, etc, which all takes experience.
Programming homework and newbie help thread Quote
08-02-2018 , 11:08 PM
SQL is on its way out anyway, dont worry
Programming homework and newbie help thread Quote
08-02-2018 , 11:55 PM
Quote:
Originally Posted by jmakin
SQL is on its way out anyway, dont worry
I'm such a dinosaur I don't even know if this is a funny or not
Programming homework and newbie help thread Quote
08-03-2018 , 01:27 AM
NoSQL is a bold new paradigm, it has reached a new and permanently high plateau.
Programming homework and newbie help thread Quote
08-03-2018 , 10:31 AM
Quote:
Originally Posted by ChrisV
NoSQL is a bold new paradigm, it has reached a new and permanently high plateau.
I've used several nosql platforms, but I don't really see them replacing SQL. Is that something that people really think will happen?

We use an in-house graph database for some stuff. I understand they are planning to rip it out and replace it with postgres.
Programming homework and newbie help thread Quote
08-03-2018 , 07:32 PM
I'm probably biased because of what I do/where I work but RDBMS's typically don't scale well for today's computing needs whcih are ever-growing and queries frequently need to be done in near real-time speeds.

In theory, it can scale, but in practice, that's entirely different. There are a lot of reasons for this but I'll probably get called out for spouting BS. You can research it, there's a lot out there.
Programming homework and newbie help thread Quote
08-03-2018 , 08:56 PM
Quote:
Originally Posted by jmakin
I'm probably biased because of what I do/where I work but RDBMS's typically don't scale well for today's computing needs whcih are ever-growing and queries frequently need to be done in near real-time speeds.

In theory, it can scale, but in practice, that's entirely different. There are a lot of reasons for this but I'll probably get called out for spouting BS. You can research it, there's a lot out there.
A RDBMS is the right tool for some jobs and not the right tool for others, sure. You don't use an array for everything, you don't use a hashmap for everything, etc.

But I think there are a lot of things best suited to a relational database that people are using nosql databases for, because they don't know, or because they're following a trend, or because they think it doesn't matter, or because it seems harder, or because it's a prototype, etc, etc.

I think people use nosql databases for a lot more wrong reasons than right ones.

ETA: I've been using "nosql" databases for longer than I've used sql ones - at least 25 years. I've written more than one myself, I've used several home grown ones, and I've used 2 modern professional strength ones. I've also spent a lot of the last 20 years using oracle, postgres, mysql, mssqlserver, sqlite, MS access, and informix. I may have left some off.

I'm currently working on a system that currently has tens of thousands of concurrent users, expected to scale to millions of concurrent users. Tables have millions of rows. Most endpoint requests do somewhere between 1 and 5 reads, and most will do at least one write. Response times are currently around 5ms on very modest hardware, single database server with no replication or clustering.

ETA 2: you know what, I forgot about redis and memcache so I guess 4 nosql databases.

Last edited by RustyBrooks; 08-03-2018 at 09:16 PM.
Programming homework and newbie help thread Quote
08-03-2018 , 10:05 PM
noSQL is such a loaded term because it can encompass such a vast variety of DB's - file based, like mongoDB, column based, like druid, or pure K/V stores - I deal mostly with the latter - and you're right, YMMV depending on what your specific use case is.

For what I'm doing though 5ms would be glacially slow. We shoot for 30-50 microseconds for most operations even under really heavy loads. Total response time would definitely be only limited by the network bandwidth. That's what we shoot for.

You could build some type of SQL-based DBMS on top of our software - but I think it would still suck.
Programming homework and newbie help thread Quote
08-03-2018 , 10:33 PM
Yeah I kind of laughed at myself internally with the 5ms number (although that is end-to-end, the entire HTTP transaction, not a single query or even all the query times together. Individual queries are under 1ms, but I don't know how much under because our reporting infrastructure reports in integer multiples of ms)

2 jobs ago I worked for a HFT company and we were given the opportunity to be market makers in currency trading. We'd get a flow of potential trades, we had to fill a specific percentage of them, and we were given 1ms to make trade decisions. When I heard this, I laughed. 1ms? Did they want us to check our work 10 times? (that job btw had at least one in-house nosql database (a timeseries database, and probably at least one other)
Programming homework and newbie help thread Quote
08-31-2018 , 10:30 AM
What's up guys, I am trying to write a web scraper to extract a bunch of old blog posts from my PGC threads, so I can post them to my wordpress site.

As a step 1, I'm trying to extract a list of URLs that correspond to individual posts that I made. Example URL in this screenshot:



So, the URL I want is contained within a table, I can identify that table through it's second tr tag -> td -> div -> a, which has attribute href="..../230247" (this corresponds to my username on the forum)

I've watched the sentdex beautifulsoup tutorial on youtube twice, but I'm struggling to figure out how to navigate through and extract this information. Here's what I have so far:

Code:
import bs4 as bs
import urllib.request
import re

url = 'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

##posts = []
##for post in soup.find_all('table', id=re.compile("post")):
##    posts.append(post.name)
##print(posts)
#### This returns a list of all the tables that are posts

##for url in soup.find_all('a', href=re.compile("230247")):
##    print(url.get('href'))
#### This finds all links that correspond to my user and prints the URLs

##my_urls = soup.find('a', href=re.compile("230247"))
##
##for parent in my_urls.parents:
##    if parent is None:
##        print(parent)
##    else:
##        print(parent.name)
#### This prints the navigation path from a relevent a tags up (doesn't work with find_all, all types are bs4.element)
Any help/pointers in the right direction greatly appreciated

Cheers
Programming homework and newbie help thread Quote
08-31-2018 , 11:06 AM
I could do that in about 5 seconds in jQuery, in fact here it is:

let links = $("table[id^='post']").has("a[href$='members/230247/']").find("tr:first-child a[id^='postcount']");

Unfortunately I don't speak Python or beautifulsoup. What I don't know how to do is that "has" part, which filters the list of tables based on whether they have a descendant element of that type. Maybe someone can translate.
Programming homework and newbie help thread Quote
08-31-2018 , 11:19 AM
One approach that might work fairly well is to just use the find_all() method to get all the elements that match your user profile URL, then use find_parent() to get to the table, and from there you should be able to call find() on the table to get the sub-element you want.
Programming homework and newbie help thread Quote
08-31-2018 , 11:25 AM
Or find_sibling
Programming homework and newbie help thread Quote
08-31-2018 , 11:50 AM
thanks so much for your help guys. was just coming back here to edit my post, made something of a breakthrough:

Code:
import bs4 as bs
import urllib.request
import re

url = 'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')


all_my_urls = soup.find('a', href=re.compile("230247"))
for parent in all_my_urls.parents:
    if parent.name == 'table':
        post_link = parent.find('a', href=re.compile("post"))
        break

print(post_link.get('href'))
>>>
https://forumserver.twoplustwo.com/s...33&postcount=1

Unfortunately this breaks when I chance soup.find to soup.find_all

But I'm thinking there'll be some hacky thing where I can just get this to run multiple times slicing through the finds..
Programming homework and newbie help thread Quote
08-31-2018 , 03:42 PM
Note: I am very much an amateur, but the code below outputs raw postcounts of all your posts on the first page of that thread.

Code:
36956233
36962774
36972076
36994108
37115053
I was confused at first, because there are 25 of your posts on the first page, not only the 5 which the script outputs. But that's for me, logged into 2p2, viewing the forum at 100ppp. Your script will not be logged in and thus only view 25ppp. So it only finds 5 of your posts.

You'll have to cycle through the pages, and then append all the raw postnumbers to the url to scrape all your posts from that thread.

Code:
#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup


url = "https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/"
user_id = "230247"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

myPosts = []

for user in soup.find_all("a", {"class" : "bigusername"}):
    if user_id in user["href"]:
        myPosts.append(user.parent["id"].split("_")[-1])

for post in myPosts:
    print(post)

If the for-loop confuses you (especially the last line), here's what it does:

We've stored all "a" tags that are of class : "bigusername" in a temporary variable called "user".
We check if it's you (230247)
Then, to get only the raw post numbers, we go to this tag's parent, which looks like this:
Code:
<div id="postmenu_xxxxxxxx">
We take its id tag, split it into a list separated at the underscore, and then take this list's last item ([-1]) to append to our list "myPosts".

A final tip: Don't hardcode "url" and "user_id". Use sys.argv instead. You never know when you might want to scrape another thread, or another user's posts.
Programming homework and newbie help thread Quote
08-31-2018 , 03:52 PM
Oh, and "user_id" should be "userId" to make it more "pythonic" if that's something you care about.
Programming homework and newbie help thread Quote
08-31-2018 , 06:27 PM
God damn, it's beautiful! Thanks very much man, I've read and understood this (I think), gonna try and recreate it from scratch just now. Definitely need to look into OOP stuff, my plan is to eventually build this out into something that can loop through multiple pages of multiple threads (and be useful for multiple users)

Feels good to be a losing 2nl player again lol
Programming homework and newbie help thread Quote
08-31-2018 , 11:23 PM
Quote:
Originally Posted by d7o1d1s0
Definitely need to look into OOP stuff
I didn't use any OOP stuff above, if that's what you think. In any event, for a simple scraper like this you won't really need classes.

Go to your python interactive shell (where you can type commands and they get executed immediately), and type the following:

Code:
import this
And read its output.

Then watch this fun and informative talk.

Programming homework and newbie help thread Quote
09-07-2018 , 12:41 PM
Thanks very much for that, guess I'll hold off on that stuff till I have a real need for it

Still working away with this, here's where I'm up to now (if this is boring/ not the intention of this thread please do tell me to GTFO)

Code:
import bs4 as bs
import urllib.request
import re


urls = [
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2013-pgc-100k-moving-up-1295523/',
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/200k-2014-crushing-msnl-1428843/',
    'https://forumserver.twoplustwo.com/174/poker-goals-amp-challenges/d7s-2018-pgc-back-1702907/'
]
#the urls to be scraped (not yet setup to loop through them, has to run on 1 thread at a time)

userId = '230247'
#OP's user number, could also read this from html


url = urls[0]

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
#sets up the beautifulsoup

############################################

a_list = []
menu_tds = soup.find_all('td', {"class": "vbmenu_control"})
substring = 'Page'

for td in menu_tds:
    if substring in td.text:
        a_list.append(td.text.split()[-1])
        break

length = int(a_list.pop())
##this finds the length of the thread (for the program, which is not logged in)

##########################################

my_posts = []

my_urls = soup.find_all('a', {"class": "bigusername"}, href=re.compile(userId))

for item in my_urls:
    my_posts.append(item.parent["id"].split("_")[-1])

for i in range(2,length+1):

    sauce = urllib.request.urlopen(url+'index'+str(i)+'.html').read()
    soup = bs.BeautifulSoup(sauce, 'lxml')
    
    my_urls = soup.find_all('a', {"class": "bigusername"}, href=re.compile(userId))

    for item in my_urls:
        my_posts.append(item.parent["id"].split("_")[-1])

my_post_urls = []

for item1 in my_posts:
    my_post_urls.append('https://forumserver.twoplustwo.com/showpost.php?p='+item1)

text = str(my_post_urls)

saveFile = open('2p2_PGC_urls_1.txt', 'a')
saveFile.write(text)
saveFile.close()
This code searches through 1 thread at a time and saves a .txt file containing a list of all the individual posts made by the user specified.

Next steps (in order of difficulty):
- Verify it's returning all my posts (102 for the first thread, know there's a way to check this on the forum but haven't found it yet)
- Loop through all the threads in URLs rather than just 1 at a time
- Find the OP's userID automatically from the html of the thread
- Build something to access the comprehensive list of post URLs and save the content (or parts of it) to [date-the-post-was-made].txt

Enjoying this a little more than is really healthy
Programming homework and newbie help thread Quote

      
m