pandas – python shift function does not work as expected

I have a data frame for the page events with the following structure:

  • sessionId – unique session id
  • event_date – date and time where the event occurred
  • pageTitle – event name (page to which users navigated)

In order to see, how users navigate in app and build a Sankey chart for it, I would like to create a new column “next event” which would be recording the next event for the session using the shift function, but the code below returns NAN values or mixing everything and display wrong next event (I see it from event_date).

df('event_date') = pd.to_datetime(df('event_date'), unit='s')
df.sort_values(('sessionId', 'pageTitle', 'event_date'),
                 ascending=(True, False, True), inplace=True)


grouped = df.groupby('sessionId')

def get_next_event(x): return x('pageTitle').shift(-1)

df("next_event") = grouped.apply(
    lambda x: get_next_event(x)).reset_index(0, drop=True)

display(df.query('sessionId =="1"').sort_values('event_date'))
display(df.query('sessionId =="2"').sort_values('event_date'))

results example

sessionId event_date pageTitle next_event
1 2021-05-26 19:23:45.820 Search Search
1 2021-05-26 19:23:50.074 Statement Statement
1 2021-05-26 19:30:06.086 Search Search
1 2021-05-26 19:30:09.995 Statement Statement
1 2021-05-26 19:47:24.058 Statement Statement
1 2021-05-26 19:57:31.661 Aging Report NAN
1 2021-05-26 19:57:40.672 Search Aging Report
1 2021-05-26 19:57:44.160 Statement Search

2nd

sessionId event_date pageTitle next_event
2 2021-07-20 15:43:35.941 Aging Report NaN
2 2021-07-20 15:44:52.739 Search Search
2 2021-07-20 15:44:56.173 Statement Statement
2 2021-07-20 16:23:02.761 Statement Statement

Please help to understand what I am doing wrong

pandas – Utilizando condicional para automatização com Python no SAP

veja se alguém consegue orientar esse iniciante aqui.

Estou fazendo a automação de um processo usando as bibliotecas: win32com.client , sys, subprocess, time e pandas.
Já consegui ler e tratar uma planilha, mapear os campos e lançar no sistema (já foi um avanço enorme para mim). Agora preciso fazer um loop para ler apenas parte da tabela e completar as informações faltantes.
É como se fosse um cabeçalho que hora tem diversos itens e hora só tem 1. Quando tem mais de um preciso preencher os demais.

Se alguém conseguir me dar uma luz eu ficarei muito agradecido. Segue abaixo o que eu fiz e o que preciso fazer.
Minha base de dados que estou lendo:

Minha planilha

Eu já consegui passar campo a campo da primeira linha e completar no sistema que estou utilizando. Agora preciso escrever uma lógica onde vou olhar a linha abaixo e se existir campos em branco ele continua a preencher a partir do campo material caso contrario ele salva no sistema e começa a preencher novamente um novo cabeçalho. Entendo que preciso fazer um “if” por aí.. mas onde e como escrevê-lo.

Segue trecho do código que estou criando

codigo

pandas – Latitude e longitude a partir de um DataFrame com CEP – Python

Estou tentando obter o endereço, latitude e longitude a partir de CEPs de uma tabela no meu banco de dados. Porém estou tendo uma série de dificuldades.

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import pycep_correios
import pandas as pd
import psycopg2

conexao = psycopg2.connect(host='meuhost', database='table', user='user', password='password')
cursor = conexao.cursor()
cursor.execute('SELECT cep FROM tb_loja')
result = cursor.fetchall()
df = pd.DataFrame(result)
df.columns = (columns(0) for columns in cursor.description)

def get_adress(cep):
    end = pycep_correios.get_address_from_cep(cep)
    return end('logradouro')

geolocator = Nominatim(user_agent='cep_lat_long')
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df('address') = df('cep').apply(get_adress)
print(df)

Erro:

Traceback (most recent call last):
  File "C:/Users/linol/Documents/parquet/cep_test.py", line 21, in <module>
    df('address') = df('cep').apply(get_adress)
  File "C:UserslinolDocumentsparquetvenvlibsite-packagespandascoreseries.py", line 4213, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas_libslib.pyx", line 2403, in pandas._libs.lib.map_infer
  File "C:/Users/linol/Documents/parquet/cep_test.py", line 15, in get_adress
    end = pycep_correios.get_address_from_cep(cep)
  File "C:UserslinolDocumentsparquetvenvlibsite-packagespycep_correiosclient.py", line 31, in get_address_from_cep
    cep = _format_cep(cep)
  File "C:UserslinolDocumentsparquetvenvlibsite-packagespycep_correiosclient.py", line 54, in _format_cep
    raise ValueError('CEP must be a non-empty string containing only numbers')  # noqa
ValueError: CEP must be a non-empty string containing only numbers

Process finished with exit code 1

pandas – Web scraping python, Guardar solamente datos que estan dentro de una latitud y longitud “X”

Estoy realizando web scraping a una pagina web de sismos, ya el codigo que tengo me extrae todos los datos de la tabla pero necesito que me almacene solo datos de la columna latitud y longitud que esten dentro del rango latitud = -21 y -27 y en longitud solo datos que esten dentro de -65 y -73. como puedo realizar este prodedimiento?

Hasta ahora el codigo me extrae todos los datos de la tabla y me genera un csv con ellos.

Les dejo el codigo:

import urllib.request
from bs4 import BeautifulSoup
import csv
import pandas as pd


e = urllib.request.urlopen("http://www.sismologia.cl/ultimos_sismos.html").read()

soup = BeautifulSoup(e, 'html.parser')

# Ejemplo de como imprimir todo
# print soup.prettify()

# Obtenemos la tabla

tabla_sismos = soup.find_all('table')(0)

# Obtenemos todas las filas
rows = tabla_sismos.find_all("tr")

output_rows = ()
for row in rows:
    # obtenemos todas las columns
    cells = row.find_all("td")
    output_row = ()
    if len(cells) > 0:
        for cell in cells:
            output_row.append(cell.text)
        output_rows.append(output_row)

dataset = pd.DataFrame(output_rows)

dataset.columns = ("Fecha Local",   "Fecha UTC",    "Latitud",  "Longitud", "Profundidad (Km)", "Magnitud", "Referencia Geográfica")


dataset.to_csv("Dataset.csv",  index=None)

pandas – Efficient way to filter multiple columns on multiple conditions in Python

I have a dataframe that looks something like this:

pd.DataFrame({'A Code': ('123', '234', '345', '234'),
              'B Code': ('345', '123', '234', '123'),
              'X Code': ('987', '765', '765', '876'),
              'Y Code': ('765', '876', '987', '765'), 
              'H Code': ('AB', 'CD', 'EF', 'AB'})

    A Code  B Code  X Code  Y Code  H Code
0     123     345     987     765     AB
1     234     123     765     876     CD
2     345     234     765     987     EF
3     234     123     876     765     GH

And, I want to find rows where A or B Code is ‘123’ and X or Y Code are ‘765’, or where H Code is ‘EF’ or ‘GH’.

I’ve used

(((df(df('A Code') == '123')) | (df(df('B Code') == '123'))) 
& ((df(df('X Code') == '765')) | (df(df('Y Code') == '765'))))
| (df(df('H Code') == 'EF'))

which works, but gets very long and messy. What’s a more efficient way to do this?

pandas – I need to format the result of my join between two dataframes where the other dataframe has more than one valid value for that index key

I have two dataframes that I am joining on the index key (‘UWI’). I use the following code to join them successfully and then write it to a CSV file to look at the result.

joinedDF = df.join(df1.set_index('New_UWI'), on='New_UWI')
joinedDF.to_csv(r'Joined_Water_Analysis_WithLithium.csv', index=False)

Code works great and one dataframe is created that shows the UWI and its info from the df1 dataframe has been joined correctly to the same UWI in the calling dataframe df1. BUT when reviewing the CSV file I realized that if there is more than one record in Df1 for a particular (UWI) the Join is forced to replicate that same (UWI) record in Df. So for example if DF had two rows of valid data for UWI
| DF UWI | Formation |
| ——– | ————– |
| Well 1 | Nisku |
| Well 1 | Belly River |

And DF1 had two rows of data for that same UWI
| DF1 UWI | Lithium |
| ——– | ————– |
| Well 1 | 77 |
| Well 1 | 8 |

the Joined DF will have four rows because the join wants to attach for each Well 1 in DF the two records from DF1. The resultant Joined DF is hard for one to scan and determine which Lithium value should go to which Well 1 row in DF. Believe me it gets worse when there are ten unique records for Well 1 in DF and in DF1 there are 2. The result is 20 rows to match which Lithium sample should go with the appropriate record for Well 1.
| DF UWI | Formation | DF1 UWI | Lithium |
| ——– | ————– |——– | ————– |
| Well 1 | Nisku |Well 1 | 71 |
| Well 1 | Belly River |Well 1 | 8.5 |
| Well 1 | Nisku | Well 1 | 71|
| Well 1 | Belly River| Well 1|8.5|

I would prefer if the Join did not replicate the rows for DF so in the CSV I would only see the two original rows for Well 1 for the DF columns. In the CSV file I am manually deleting the replicate values in the cells for Well 1 DF so the result looks more like this. But is there a way to make Join do this formatting for me because this is a long and tiring manual process?

DF UWI Formation DF1 UWI Lithium
Well 1 Nisku Well 1 71
Well 1 Belly River Well 1 8.5
Well 1 71
Well 1 8.5

To make this example short I am not showing that in both Dataframes there are common columns such as Formation, Top Interval, Bottom Interval, KB elev etc that help us determine which row in Well 1 gets the unique Lithium value from DF1.

Using user’s input to choose which csv file out of several files the code will run on (python pandas)

Let’s say that I want to calculate the mean of a certain column in a csv file using pandas. what I know is I that should use this code:

import pandas as pd
x_file = pd.read_csv("(x_file PATH)")
print(x_file("column").mean())

Now, if I have several files having the same exact columns and whether to get the mean of the column from x_file or y_file is based on the user’s input, how can I change the x_file(“column”) part to the input of the user whichever file they choose?

pandas – Função if em Python

Olá,
Em um banco de dados com as informações de vacinas por dia, quero comparar se a soma de cada laboratório (Sinovac, Pfizer e Astrazeneca) é diferente do total reportado. Se for, quero que imprima a data.

Estou usando o seguinte código:

Fecha=data(("Fecha"))

for i in Fecha:
    primera = (data("1era Dosis Sinovac") + data("1era Dosis Pfizer") + data("1era Dosis Astrazeneca"))
    print(primera)
    if primera != data("Total Dosis 1"):
        print(i)

E o erro que me dá é:


  File "<ipython-input-38-f2dd61100bee>", line 4, in <module>
    if primera > data("Total Dosis 1"):

  File "C:Usersandreanaconda3libsite-packagespandascoregeneric.py", line 1442, in __nonzero__
    raise ValueError(

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Agradeço a ajuda de vocês!

python – optimize pandas code, add column with count by constraint

Could someone please help me out, I’m trying to remove the need to iterate through the dataframe and know it is likely very easy for someone with the knowledge.

Dataframe:

    id racecourse going distance runners draw draw_bias
0   253375  178 Standard    7.0 13  2   0.50
1   253375  178 Standard    7.0 13  11  0.25
2   253375  178 Standard    7.0 13  12  1.00
3   253376  178 Standard    6.0 12  2   1.00
4   253376  178 Standard    6.0 12  8   0.50
... ... ... ... ... ... ... ...
378867  4802789 192 Standard    7.0 16  11  0.50
378868  4802789 192 Standard    7.0 16  16  0.10
378869  4802790 192 Standard    7.0 16  1   0.25
378870  4802790 192 Standard    7.0 16  3   0.50
378871  4802790 192 Standard    7.0 16  8   1.00
378872 rows × 7 columns

What I need is to add a new column with the count of unique races (id) by the conditions defined below. This code works as expected but it is sooo slow….

df('race_count') = None
for i, row in df.iterrows():
  df.at(i, 'race_count') = df.loc((df.racecourse==row.racecourse)&(df.going==row.going)&(df.distance==row.distance)&(df.runners==row.runners), 'id').nunique()