python – How to do vector operations on pyspark dataframe columns?

I have a dataframe like so:

id | vector1 | id2 | vector2

where the ids are ints and the vectors are SparseVector types.

For each row, I want to add on a column that is cosine similarity, which would be done by
vector1.dot(vector2)/(sqrt(vector1.dot(vector1)*sqrt(vector2.dot(vector2))
but I can’t figure out how to use this to put it into a new column. I’ve tried making a udf, but can’t seem to figure it out

How do I make a heatmap with seaborn module in python using the pandas dataframe given below?

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

python – Pandas add calculated row for every row in a dataframe

I have a dataframe like so:

id  variable  value
1      x        5
2      x        7

Now for every row, I want to add a calculated row. This is what I am doing as of now:

a = 2
b = 5
df2 = pd.DataFrame(columns =  ("id", "variable", "value"))
for index, row in df.iterrows():
    df2 = df2.append({'id':row('id'), 'variable':'y', 'value':a*row('value')+5}, ignore_index=True)
df = pd.concat((df, df2))
df = df.sort_values(('id', 'variable'))

And so finally I get:

id  variable  value
1      x        5
1      y        15
2      x        7
2      y        19

But surely there must be a better way to do this. Perhaps where I could avoid for loop and sorting, as there are a lot of rows.

filter – R: How to Remove Rows with condition from another Group_By dataframe when Row Count is >1

I have the following sample dataset:

structure(list(Vno = c(1111, 1111, 2222, 3333, 3333, 4444, 5555, 
5555), ID = c("A001", "X011", "B002", "C003", "Y033", "D004", 
"E005", "X055"), Name = c("John", "S/O JJJ", "S/O LLL", "Jane", 
"D/O MMM", "S/O ZZZ", "Nicole", "D/O ZZZ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

Output:

> df
# A tibble: 8 x 3
    Vno ID    Name   
  <dbl> <chr> <chr>  
1  1111 A001  John   
2  1111 X011  S/O JJJ
3  2222 B002  S/O LLL
4  3333 C003  Jane   
5  3333 Y033  D/O MMM
6  4444 D004  S/O ZZZ
7  5555 E005  Nicole 
8  5555 X055  D/O ZZZ

What the expected output is to filter out Name which starts with ‘S/O’ or ‘D/O’, when the group-by(Vno) count is >1. But, my attempt below had removed even single row with ‘S/O’ or ‘D/O’:

pt_byVno <- df %>%
  group_by(Vno) %>%
  filter(!grepl('S/O|D/O',Name)) %>%
  print
    Vno ID    Name  
  <dbl> <chr> <chr> 
1  1111 A001  John  
2  2222 B002  Mark  
3  4444 D004  Nicole

The desired output should be:

# A tibble: 5 x 3
    Vno ID    Name   
  <dbl> <chr> <chr>  
1  1111 A001  John   
2  2222 B002  S/O LLL
3  3333 C003  Jane   
4  4444 D004  S/O ZZZ
5  5555 E005  Nicole 

Appreciate for any R experts help here, thanks!

python – Get the last activity per user per day in a dataframe

I have many users. Each time a user uses their smartphone it register them. I am determining the last time each user used their smartphone each day.
Additionally smartphone usages from 18:00 to 06:00, the next day, should be considered as an entry on the previous day. I have created a dummy example.

I did the following:

  1. First subtract the number of hours.
  2. Sort the data frame based on user and date time.
  3. Get the last row.

Is there a more efficient approach to this? Are there other tips I can follow to improve my code?

df_example = {'id': (1,1,1,1,1),
             'activity': (datetime.datetime(2019, 12, 1, 19, 30, 1),
                         datetime.datetime(2019, 12, 1, 20, 22, 2),
                         datetime.datetime(2019, 12, 2, 2, 13, 2),
                         datetime.datetime(2019, 12, 3, 19, 12, 2),
                         datetime.datetime(2019, 12, 3, 21, 3, 1)
                         )}
df_example = pd.DataFrame(df_example, columns = ('id', 'activity'))
df_example('activity') = df_example('activity') - datetime.timedelta(hours=6, minutes=0)
df_example('date') = df_example('activity').apply(lambda x: x.date())
df_example.sort_values(by=('id', 'activity'))
df_example.groupby(('id', 'date')).tail(1)

dataframe – breaking While True loop in x minutes

I am trying to find a way to break a while true loop after x minutes, below is my code. And appending the new data into dataframe via i() function

import time
import bs4
import requests
from bs4 import BeautifulSoup
import pandas as pd

price=()
pricetime=()


r=requests.get('https://finance.yahoo.com/quote/SPY?p=SPY&.tsrc=fin-srch')
soup = bs4.BeautifulSoup(r.text,'lxml')
current_price = soup.find_all('div',{'class':"My(6px) Pos(r) smartphone_Mt(6px)"})(0).find('span').text

def i():
    while True:
        print(current_price + str(time.ctime()))
        price.append(current_price)
        pricetime.append(time.ctime())
        time.sleep(10)

python – ¿Cómo transformar un dataframe con Multipolygons en un geodataframe?

Tengo un excel con el siguiente formato:

introducir la descripción de la imagen aquí

Necesito cargarlo como geodataframe, pero como es un excel solo consigo cargarlo como dataframe:
df=pd.read_excel('C:/...)

Al intentar cargarlo con geopandas me da error ya que necesita un fichero geojson y estoy metiendo un xls
df= gpd.read_file("C:/...)

¿Cómo puedo cargarlo de forma que python lo reconozca como geodataframe?

python – How to group every other unique value in pandas dataframe outside top 5 values (by size) into an ‘Other’ category for plotting and tabling?

I have a column Order with over 75 unique values of items (Clothes, Appliances, Electronics, etc.) and over 1000 entries in total. When I try to plot any descriptive statistics on a graph over a period of time (important: not all order items are non-zero in all periods of time), it becomes hard to read the graph because of how small some of the values are in comparison to others.

For that reason I think it would be better for me to just show the values of the top 3-5 order items (by size/count) and then group the rest into a category called Other just for plotting and groupby/pivot tabling purposes.

I guess I could map the values, but that is too time consuming, and it would also require me to change the existing values for the non top 3-5 represented entries – which is not something that I want to do.

How would I do that?