Convertir dataframe a un diccionario- python

Tengo un csv de este tipo

Pais           1/22/20    2/20/20    7/2/20
Afghanistan     100         20         5
Albania         50          10         3
Algeria         30          0          0

Donde las columnas son fechas y el contenido del mismos son casos.
El objetivo en convertirlo a este tipo de diccionario:

{
"Pais1": {"time": [1/22/20, 1/23/20,...], "cases": [0, 0,...]},
"Pais2": {"time": [1/22/20, 1/23/20,...], "cases": [0, 0,...]},
...
}

No consigo que quede exactamente así, ya que me sale como una lista del tipo:

[{'Afghanistan': {'time': ['1/22/20', '2/20/20', '7/2/20'],
   'cases': [100, 20, 5]}},
 {'Albania': {'time': ['1/22/20', '2/20/20', '7/2/20'],
   'cases': [50, 10, 3]}},
 {'Algeria': {'time': ['1/22/20', '2/20/20', '7/2/20'],
   'cases': [30, 0, 0]}}]

pandas – Como iterar sobre grupos de un dataframe

Quería saber si se puede, y como se hace, un loop sobre grupos de un dataframe de Pandas. Tengo un archivo de texto, que lo importo con Pandas:

file= ("E:Test.txt")
df = pd.read_csv(file, sep=';', dtype='str')
df

Esto me da la siguiente informacion:

      Fecha      ID   Alum Par  Nombre  Codigo  Descrip Cantidad    Tiempo
0   09/03/2021  A935    809 00  CARMEN  660192  Hard Disk   1   2
1   09/03/2021  A935    809 00  CARMEN  660412  Floppy  25  1.5
2   09/03/2021  A935    809 00  CARMEN  660475  SSD 1   3
3   09/03/2021  A217    800 00  CONCEPCION  661070  DVD 1   15
4   09/03/2021  A217    800 00  CONCEPCION  662734  CD  3   36
5   09/03/2021  A218    801 00  ELVIRA  660192  Hard Disk   1   2
6   09/03/2021  A232    909 16  LORENZO 660343  Ata Disk    1   2
7   09/03/2021  A232    909 16  LORENZO 660475  SSD 1   3

siendo la primera columna el índice.

Dado que me intersa agruparlo por ID, aplico el siguiente código:

gb = df.groupby(('ID'))
for k, gp in gb:
   print ('key=' + str(k))
   print (gp)

con lo que llego a los grupos que me interesa, y que me sirve para poder ver cuantas filas tiene cada ‘ID’. El resulado es el siguiente

key=A217
        Fecha    ID  Alum Par    Nombre  Codigo Descrip Cantidad Tiempo
3  09/03/2021  A217   800  00  CONCEPCION  661070     DVD        1     15
4  09/03/2021  A217   800  00  CONCEPCION  662734      CD        3     36
key=A218
        Fecha    ID Socio Par  Nombre  Codigo    Descrip Cantidad Tiempo
5  09/03/2021  A218   801  00  ELVIRA  660192  Hard Disk        1      2
key=A232
        Fecha    ID Socio Par   Nombre  Codigo   Descrip Cantidad Tiempo
6  09/03/2021  A232   909  16  LORENZO  660343  Ata Disk        1      2
7  09/03/2021  A232   909  16  LORENZO  660475       SSD        1      3
key=A935
        Fecha    ID Socio Par  Nombre  Codigo    Descrip Cantidad Tiempo
0  09/03/2021  A935   809  00  CARMEN  660192  Hard Disk        1      2
1  09/03/2021  A935   809  00  CARMEN  660412     Floppy       25    1.5
2  09/03/2021  A935   809  00  CARMEN  660475        SSD        1      3

Mi consulta es como puedo iterar dentro de los grupos para poder llegar a tener esta informacion mas facil de ver para cada grupo (ID):

   Resumen
   El 'Alum', 'Nombre', tiene 'Cantidad' de 'Descrip'.
   Fin  

Donde tendría que tener 1 línea por línea del grupo. Hay forma de iterar dentro de cada grupo en Pandas? Cualquier sugerencia es bienvenida.
Desde ya muchas gracias.

python – Filtering a DataFrame based on two logical conditions, first one numpy array values, second one current day based

The following is just a working example.

I have a DataFrame containings a monotonous growing function.
Some of the values are actuals, some are forecasted.

I need to filter specific actuals values based on a set of milestones and I have to avoid to take forecasted values from the dataframe

I created this following script.
It works, but I think is not so much pythoninc.

I am a self taught and my working eviroment is Google Colab

Expected Output

  • I would avoid the for loop and the if condition

  • Understand if there is room of improvement in the code quality

    #importing libraries 
    import pandas as pd 
    import numpy as np
    import datetime
    
    #working code mock-up
    th_array = np.arange(0, 11000, 1000)
    cumulated_array = np.arange(0, 5000, 185)
    
    df_index = pd.date_range(end = "20/04/2021", 
                             periods = len(cumulated_array))
    df = pd.DataFrame(data = cumulated_array,index = df_index, 
                    columns = ("cumulated"))
    
    
    
    df_filtered = pd.DataFrame()
    current_day =  pd.to_datetime(datetime.date.today())
    
    #filtering loop
    for y in th_array:
      x = df((df('cumulated') > y) & (df.index < current_day))
      if x.empty is False:
          df_filtered = df_filtered.append(x.iloc(0))
    

beginner – Convert a matrix (“DataFrame”) to printable string

This question is about the V language. If you have 300+ rep please change the “go” tag to “v” and remove this text. Thanks!

pub fn (df DataFrame) str() string {
    // measure column widths
    mut width := ()int{len: 1 + df.cols.len}
    mut str_sers := ()Series{len: 1 + df.cols.len}
    mut sers := (df.index)
    sers << df.cols
    mut i := 0
    for ser in sers {
        str_sers(i) = ser.as_str()
        mut row_strs := str_sers(i).get_str().clone()
        row_strs << (ser.name)
        width(i) = maxlen(row_strs)
        i++
    }

    // columns
    pad := '                                        '
    mut row_strs := ()string{len: sers.len}
    i = 0
    for ser in sers {
        w := width(i)
        row_strs(i) = pad(0..(w - ser.name.len)) + ser.name
        i++
    }
    mut s := row_strs.join('  ')

    // cell data
    l := df.len()
    if l == 0 {
        s += 'n(empty DataFrame)'
    }
    for r in 0 .. l {
        i = 0
        for ser in str_sers {
            w := width(i)
            row_strs(i) = pad(0..(w - ser.get_str()(r).len)) + ser.get_str()(r)
            i++
        }
        s += 'n' + row_strs.join('  ')
    }
    return s
}

While I understand that “V is a simple language,” I find myself repeating a lot of stuff. For instance the whole first block, “measure column widths,” could beautifully be accomplished by two simple list comprehension lines in Python. Is there some way to simplify the allocate array, initialize the index pointer, loop-assign, increase index pointer code? Any equivalent to Python’s enumerate() would go a long way…

What do I do about the pad thing? Sure, it will work in 99.9% instances, but would be nice with an arbitrary length solution (such as ' '*n in Python).

Performance-wise I’m not concerned for this function in particular, but I’m curious if there are any obvious blunders? It looks close to C to me, and it’s probably only cache handling that could be a problem? I’m assuming the three mutable arrays end up on the heap? If so: anything I can do about that?

I’m left with the sense that “simple language” in this case is equivalent to “hard-to-read implementation.” With just a few more features it could be both simple and easy to read? What is your impression?

dataframe – python pandas, how to change value in pivot table

i’m trying to make pivot table which describes which number customer chose.

original dataframe is below,

Customer Offer #

1 Smith 2
2 Smith 24
3 Johnson 17
4 Johnson 24
5 Johnson 26

in my code,
df.pivot(index=’Offer #’, columns=’Customer, values=’Offer #’)

and result is,
Smith Johnson
Offer#
2 2 NaN
17 NaN 17
24 24 24
26 NaN 26

but, i’m about to change NaN to 0, and every values(2,17,24,…) to 1.

python – Visualización en 3D de las columnas de un dataframe

Quiero hacer una visualización en 3D de las columnas beta1, beta2 y cost del siguiente marco de datos.

>>> df_thetas_value.head()
    beta0   beta1   beta2   cost
0   0.511275    0.404934    0.783799    2.820328e+07
1   34.486883   123.591098  143.711200  1.122274e+06
2   36.435332   163.909685  118.786188  8.688915e+05
3   40.692430   204.987832  113.643168  8.072207e+05
4   42.270578   237.838460  91.286946   6.112149e+05

Entonces queria hacer como en este articulo:

# print(len(df_thetas_value))
world = np.zeros((len(df_thetas_value), len(df_thetas_value)))
for i, row in df_thetas_value.iterrows():
    print(item)
    i,j = row("beta1"), row("beta2")
    world(i)(j) = row("cost")
world

Pero hay un problema: los i,j no son int en mi caso. Entonces me devuelve:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-38-2732bcd8c872> in <module>
      3 for i, row in df_thetas_value.iterrows():
      4     i,j = row("beta1"), row("beta2")
----> 5     world(i)(j) = row("cost")
      6 world
      7 

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Pensaba hacer un dicionario de las posiciones {(1,1): (0.404934, 0.783799) ...

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D

## Matplotlib Sample Code using 2D arrays via meshgrid
X, Y = np.meshgrid(df_thetas_value('beta1').values, df_thetas_value('beta2').values)
Z = df_thetas_value('cost').values
fig = plt.figure()
ax = Axes3D(fig)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)
ax.set_zlim(-1.01, 1.01)

ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

fig.colorbar(surf, shrink=0.5, aspect=5)
plt.title('Original Code')
plt.show()

Pero Z no esta en la buena diemnsion. Todos los elementos necesitan haber un tamano de 2. En efecto, esto devuelve:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-b836e6304023> in <module>
     14 ax = Axes3D(fig)
     15 surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
---> 16                        linewidth=0, antialiased=False)
     17 ax.set_zlim(-1.01, 1.01)
     18 

/opt/conda/lib/python3.7/site-packages/mpl_toolkits/mplot3d/axes3d.py in plot_surface(self, X, Y, Z, norm, vmin, vmax, lightsource, *args, **kwargs)
   1554 
   1555         if Z.ndim != 2:
-> 1556             raise ValueError("Argument Z must be 2-dimensional.")
   1557         if np.any(np.isnan(Z)):
   1558             cbook._warn_external(

ValueError: Argument Z must be 2-dimensional.

Tambien intenté:

ax.plot_trisurf(df_thetas_value('beta1').values, df_thetas_value('beta2').values, df_thetas_value('cost').values, cmap=cm.jet, linewidth=0.2)
plt.show()

Pero no se muestra nada.

Al final hizé:

from scipy.interpolate import griddata

# 2D-arrays from DataFrame
df = df_thetas_value
x1 = np.linspace(df('beta1').min(), df('beta1').max(), len(df('beta1').unique()))
y1 = np.linspace(df('beta2').min(), df('beta2').max(), len(df('beta2').unique()))

"""
x, y via meshgrid for vectorized evaluation of
2 scalar/vector fields over 2-D grids, given
one-dimensional coordinate arrays x1, x2,..., xn.
"""

x2, y2 = np.meshgrid(x1, y1)

# Interpolate unstructured D-dimensional data.
z2 = griddata((df('beta1'), df('beta2')), df('cost'), (x2, y2), method='cubic')

# Ready to plot
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(x2, y2, z2, rstride=1, cstride=1, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)
ax.set_zlim(-1.01, 1.01)

ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

fig.colorbar(surf, shrink=0.5, aspect=5)
plt.title('Meshgrid Created from 3 1D Arrays')

plt.show()

Pero solo se muestra el background:

introducir la descripción de la imagen aquí

windows – how do i convert random number of vectors in a dataframe into columns?

   Date approved    Company requested 1       Company requested 2     Identifier Number
    "29/03/2021"  "1st part ABC Company 1 " "2nd part ABC company 1 " "Identifier 1 "  
    "29/32/2021"  "1st part ABC company 2 " "2nd part ABC company 2 " "Identifier 2 "  
    "29/03/2021"  "1st part ABC company 3 " "2nd part ABC company 3 " "Identifier 3 "  
    "29/03/2021"  "1st part ABC company 4 " "2nd part ABC company 4 " "Identifier 4 "  
    "29/03/2021"  "1st part ABC company 5 " "2nd part ABC company 5 " "Identifier 5 "  

All these data above were taken from PDF and i have turned them into R and transpose them accordingly to come up with the data above. It seems to me that they are individual vectors (correct me if i’m wrong, I’m still quite new to this language).

So anywayz, i wanna convert ‘Company requested 1’ and the following ‘1st part ABC company 1, 2, 3, etc.’ into a column and merge it with ‘2nd part ABC company 1’ and all the vectors ‘2nd part,etc…’

I did see something along the line at https://stackoverflow.com/questions/40921426/converting-array-to-matrix-in-r but my data is quite random. Meaning to say i might have ‘1st part ABC company 1,2,3,4,5’ on the first day but i can have ‘1st part ABC company 1,2,3’ on the 2nd day and 3rd day ‘1st part ABC company 1,2,3,4,5,6,7,8’.

Anyone can help me out here?

collections – R Container for Multiple data.frame with a Brief Description of the Content of the data.frame

for a project I have a large dataset with a multitude of variables from different questionnaires.
Not all variables are required for all analyses.
So I created a preprocessing script, in which subsets of variables (with and without abbreviations)
are created. However it gets confusing pretty fast.
For convinence I decided to create a index_list which holds all data.frames as well as a data.frame called index_df which holds the name of the respective data.frame as well as a brief description of each subversion of the dataset.

######################## Preparation / Loading #####################
# Clean Out Global Environment
rm(list=ls())

# Detach all unnecessary pacakges
pacman::p_unload()

# Load Required Libraries
pacman::p_load(dplyr, tidyr, gridExtra, conflicted)

# Load Data
#source("00_Preprocess.R")
#create simulation data instead
sub_data <- data.frame(x=c(2,3,5,1,6),y=c(20,30,10,302,5))
uv_newScale <- data.frame(item1=c(2,3,5,1,6),item2=c(3,5,1,3,2))

# Resolving conflicted Namepsaces
conflict_prefer("filter", "dplyr")

# Creating an Index 
index_list <- list("sub_data"=sub_data,
                   "uv_newScale"=uv_newScale
                   )

index_df <- data.frame("Data.Frame"=c("sub_data",
                                        "uv_newScale"),
                       "Description"=c("Contains all sumscales + sociodemographics, names abbreviated",
                                       "Only sum scores for the UV Scale"))

I am wondering if there is a more efficient way to do so. Like saving the data.frames together with the description in one container?